Unsupervised Web Topic Detection Using A Ranked Clustering-Like Pattern Across Similarity Cascades

Despite the massive growth of social media on the Internet, the process of organizing, understanding, and monitoring user generated content (UGC) has become one of the most pressing problems in today's society. Discovering topics on the web from a huge volume of UGC is one of the promising approaches to achieve this goal. Compared with classical topic detection and tracking in news articles, identifying topics on the web is by no means easy due to the noisy, sparse, and less- constrained data on the Internet. In this paper, we investigate methods from the perspective of similarity diffusion, and propose a clustering-like pattern across similarity cascades (SCs). SCs are a series of subgraphs generated by truncating a similarity graph with a set of thresholds, and then maximal cliques are used to capture topics. Finally, a topic-restricted similarity diffusion process is proposed to efficiently identify real topics from a large number of candidates. Experiments demonstrate that our approach outperforms the state-of-the-art methods on three public data sets.

[1]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[2]  Arindam Banerjee,et al.  Topic Models over Text Streams: A Study of Batch and Online Unsupervised Learning , 2007, SDM.

[3]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[4]  Yongdong Zhang,et al.  Tracking Web Video Topics: Discovery, Visualization, and Monitoring , 2011, IEEE Transactions on Circuits and Systems for Video Technology.

[5]  Qingming Huang,et al.  Cross-media topic detection: A multi-modality fusion framework , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[6]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[7]  Nan Sun,et al.  Exploiting internal and external semantics for the clustering of short texts using world knowledge , 2009, CIKM.

[8]  Yiannis Kompatsiaris,et al.  Cluster-Based Landmark and Event Detection for Tagged Photo Collections , 2011, IEEE MultiMedia.

[9]  Min Zhang,et al.  Automatic online news issue construction in web environment , 2008, WWW.

[10]  C. Bron,et al.  Algorithm 457: finding all cliques of an undirected graph , 1973 .

[11]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  Yiannis Kompatsiaris,et al.  Sensing Trending Topics in Twitter , 2013, IEEE Transactions on Multimedia.

[14]  Dafna Shahaf,et al.  Connecting the dots between news articles , 2011, IJCAI 2011.

[15]  T. Gevers,et al.  UvA-DARE ( Digital Academic Repository ) Robust Histogram Construction from Color Invariants for Object Recognition , 2003 .

[16]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[17]  Ebroul Izquierdo,et al.  Social event detection and retrieval in collaborative photo collections , 2012, ICMR '12.

[18]  Ee-Peng Lim,et al.  Analyzing feature trajectories for event detection , 2007, SIGIR.

[19]  L. Lucy An iterative technique for the rectification of observed distributions , 1974 .

[20]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[21]  Chong-Wah Ngo,et al.  Novelty detection for cross-lingual news stories with visual duplicates and speech transcripts , 2007, ACM Multimedia.

[22]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.

[23]  Jintao Li,et al.  The use of topic evolution to help users browse and find answers in news video corpus , 2007, ACM Multimedia.

[24]  Prasenjit Mitra,et al.  Event detection with spatial latent Dirichlet allocation , 2011, JCDL '11.

[25]  Yan Liu,et al.  Topic-link LDA: joint models of topic and author community , 2009, ICML '09.

[26]  Timothy Baldwin,et al.  Automatically Constructing a Normalisation Dictionary for Microblogs , 2012, EMNLP.

[27]  L. Shepp,et al.  Maximum Likelihood Reconstruction for Emission Tomography , 1983, IEEE Transactions on Medical Imaging.

[28]  Yiannis Kompatsiaris,et al.  Social Event Detection at MediaEval 2012: Challenges, Dataset and Evaluation , 2012, MediaEval.

[29]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[30]  Qingming Huang,et al.  An effective multi-clue fusion approach for web video topic detection , 2012, ACM Multimedia.

[31]  Hagai Attias,et al.  Topic regression multi-modal Latent Dirichlet Allocation for image annotation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[32]  William H. Richardson,et al.  Bayesian-Based Iterative Method of Image Restoration , 1972 .

[33]  Shuicheng Yan,et al.  Robust Graph Mode Seeking by Graph Shift , 2010, ICML.

[34]  Edwin V. Bonilla,et al.  Improving Topic Coherence with Regularized Topic Models , 2011, NIPS.

[35]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[36]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.

[37]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[38]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[39]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[40]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[41]  Qi He,et al.  Keep It Simple with Time: A Reexamination of Probabilistic Topic Detection Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Aixin Sun,et al.  Query-Guided Event Detection From News and Blog Streams , 2011, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[43]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[44]  Tie-Yan Liu,et al.  Learning to Rank for Information Retrieval , 2011 .

[45]  Akira Tanaka,et al.  The worst-case time complexity for generating all maximal cliques and computational experiments , 2006, Theor. Comput. Sci..

[46]  Thorsten Brants,et al.  A System for new event detection , 2003, SIGIR.

[47]  Jing Zhao,et al.  Document Clustering Based on Nonnegative Sparse Matrix Factorization , 2005, ICNC.

[48]  Manik Varma,et al.  More generality in efficient multiple kernel learning , 2009, ICML '09.

[49]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.