Using paraphrases for improving first story detection in news and Twitter

First story detection (FSD) involves identifying first stories about events from a continuous stream of documents. A major problem in this task is the high degree of lexical variation in documents which makes it very difficult to detect stories that talk about the same event but expressed using different words. We suggest using paraphrases to alleviate this problem, making this the first work to use paraphrases for FSD. We show a novel way of integrating paraphrases with locality sensitive hashing (LSH) in order to obtain an efficient FSD system that can scale to very large datasets. Our system achieves state-of-the-art results on the first story detection task, beating both the best supervised and unsupervised systems. To test our approach on large data, we construct a corpus of events for Twitter, consisting of 50 million documents, and show that paraphrasing is also beneficial in this domain.

[1]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[2]  Benjamin Rey,et al.  Generating query substitutions , 2006, WWW '06.

[3]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[4]  James Allan,et al.  Using Names and Topics for New Event Detection , 2005, HLT/EMNLP.

[5]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[6]  Hila Becker,et al.  Selecting Quality Twitter Content for Events , 2011, ICWSM.

[7]  Mirella Lapata,et al.  Constructing Corpora for the Development and Evaluation of Paraphrase Systems , 2008, CL.

[8]  Chris Callison-Burch,et al.  Syntactic Constraints on Paraphrases Extracted from Parallel Corpora , 2008, EMNLP.

[9]  Yi Liu,et al.  Statistical Machine Translation for Query Expansion in Answer Retrieval , 2007, ACL.

[10]  Chris Quirk,et al.  Monolingual Machine Translation for Paraphrase Generation , 2004, EMNLP.

[11]  James Allan,et al.  Detections , Bounds , and Timelines : UMass and TDT-3 , 2000 .

[12]  Nitin Madnani,et al.  Generating Phrasal and Sentential Paraphrases: A Survey of Data-Driven Methods , 2010, CL.

[13]  Alexander J. Smola,et al.  Unified analysis of streaming news , 2011, WWW.

[14]  Ashwin Lall,et al.  Online Generation of Locality Sensitive Hash Signatures , 2010, ACL.

[15]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[16]  James Allan,et al.  First story detection in TDT is hard , 2000, CIKM '00.

[17]  Miles Osborne,et al.  Streaming First Story Detection with application to Twitter , 2010, NAACL.

[18]  Philipp Koehn,et al.  Improved Statistical Machine Translation Using Paraphrases , 2006, NAACL.

[19]  Karen Spärck Jones,et al.  Automatic Search Term variant Generation , 1984, J. Documentation.

[20]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.