Representations for multi-document event clustering

We study several techniques for representing, fusing and comparing content representations of news documents. As underlying models we consider the vector space model (both in a term setting and in a latent semantic analysis setting) and probabilistic topic models based on latent Dirichlet allocation. Content terms can be classified as topical terms or named entities, yielding several models for content fusion and comparison. All used methods are completely unsupervised. We find that simple methods can still outperform the current state-of-the-art techniques.

[1]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[2]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[3]  Bin Wang,et al.  A probabilistic model for retrospective news event detection , 2005, SIGIR '05.

[4]  Christopher D. Manning,et al.  Finding Contradictions in Text , 2008, ACL.

[5]  Ramesh Nallapati,et al.  Event threading within news topics , 2004, CIKM '04.

[6]  Breck Baldwin,et al.  Algorithms for Scoring Coreference Chains , 1998 .

[7]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[8]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[9]  Michael D. Lee,et al.  An Empirical Evaluation of Models of Text Document Similarity , 2005 .

[10]  Karel Jezek,et al.  Update summarization based on novel topic distribution , 2009, DocEng '09.

[11]  Glenn Shafer,et al.  A Mathematical Theory of Evidence , 2020, A Mathematical Theory of Evidence.

[12]  Yiming Yang,et al.  Learning approaches for detecting and tracking news events , 1999, IEEE Intell. Syst..

[13]  Marie-Francine Moens,et al.  The latent words language model , 2012, Comput. Speech Lang..

[14]  Kathleen R. McKeown,et al.  Automatic acquisition of lexical semantic knowledge from large corpora: the identification of semantically related words, markedness, polarity, and antonymy , 1998 .

[15]  James Allan,et al.  Retrieval and novelty detection at the sentence level , 2003, SIGIR.

[16]  Regina Barzilay,et al.  Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment , 2003, NAACL.

[17]  Salvatore J. Stolfo,et al.  Combining email models for false positive reduction , 2005, KDD '05.

[18]  Marko Grobelnik,et al.  Subspace, Latent Structure and Feature Selection techniques , 2006 .

[19]  James Allan,et al.  Text classification and named entities for new event detection , 2004, SIGIR '04.

[20]  Dragomir R. Radev,et al.  Generating summaries of multiple news articles , 1995, SIGIR '95.

[21]  Aleks Jakulin,et al.  Discrete Component Analysis , 2005, SLSFS.

[22]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[23]  Peter J. Kwantes,et al.  Comparing Methods for Single Paragraph Similarity Analysis , 2011, Top. Cogn. Sci..

[24]  Kuo Zhang,et al.  New event detection based on indexing-tree and named entity , 2007, SIGIR.

[25]  Karel Jezek,et al.  Two uses of anaphora resolution in summarization , 2007, Inf. Process. Manag..

[26]  Iraklis Varlamis,et al.  Text Relatedness Based on a Word Thesaurus , 2010, J. Artif. Intell. Res..

[27]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[28]  James Allan,et al.  Explorations within topic tracking and detection , 2002 .

[29]  Ellen M. Voorhees,et al.  Implementing agglomerative hierarchic clustering algorithms for use in document retrieval , 1986, Inf. Process. Manag..

[30]  Mark Steyvers,et al.  Topics in semantic representation. , 2007, Psychological review.

[31]  Xin Liu,et al.  Generic text summarization using relevance measure and latent semantic analysis , 2001, SIGIR '01.

[32]  Jianfeng Gao,et al.  Multi-style language model for web scale information retrieval , 2010, SIGIR '10.

[33]  Nello Cristianini,et al.  Wrapping up a Summary: From Representation to Generation , 2010, ACL.

[34]  Andrew McCallum,et al.  Topic and Role Discovery in Social Networks , 2005, IJCAI.

[35]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[36]  Helena Ahonen-Myka,et al.  Applying Semantic Classes in Event Detection and Tracking , 2002 .

[37]  Yiyu Yao,et al.  An analysis of vector space models based on computational geometry , 1992, SIGIR '92.

[38]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.