Update Frequency and Background Corpus Selection in Dynamic TF-IDF Models for First Story Detection

First Story Detection (FSD) requires a system to detect the very first story that mentions an event from a stream of stories. Nearest neighbour-based models, using the traditional term vector document representations like TF-IDF, currently achieve the state of the art in FSD. Because of its online nature, a dynamic term vector model that is incrementally updated during the detection process is usually adopted for FSD instead of a static model. However, very little research has investigated the selection of hyper-parameters and the background corpora for a dynamic model. In this paper, we analyse how a dynamic term vector model works for FSD, and investigate the impact of different update frequencies and background corpora on FSD performance. Our results show that dynamic models with high update frequencies outperform static model and dynamic models with low update frequencies; and that the FSD performance of dynamic models does not always increase with higher update frequencies, but instead reaches steady state after some update frequency threshold is reached. In addition, we demonstrate that different background corpora have very limited influence on the dynamic models with high update frequencies in terms of FSD performance.

[1]  Fei Wang,et al.  Exploring Online Novelty Detection Using First Story Detection Models , 2018, IDEAL.

[2]  Thorsten Brants,et al.  A System for new event detection , 2003, SIGIR.

[3]  Junshui Ma,et al.  Online novelty detection on temporal sequences , 2003, KDD '03.

[4]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[5]  James Allan,et al.  Detections , Bounds , and Timelines : UMass and TDT-3 , 2000 .

[6]  John D. Kelleher,et al.  Bigger versus Similar: Selecting a Background Corpus for First Story Detection Based on Distributional Similarity , 2019, RANLP.

[7]  Brigham Young The Corpus of Contemporary American English as the first reliable monitor corpus of English , 2010 .

[8]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.

[9]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[10]  Meng Zhang,et al.  Neural Network Methods for Natural Language Processing , 2017, Computational Linguistics.

[11]  Miles Osborne,et al.  Streaming First Story Detection with application to Twitter , 2010, NAACL.

[12]  Mark Davies Expanding horizons in historical linguistics with the 400-million word Corpus of Historical American English , 2012 .

[13]  Alexis Joly,et al.  Adversarial autoencoders for novelty detection , 2017 .

[14]  Miles Osborne,et al.  Twitter-scale New Event Detection via K-term Hashing , 2015, EMNLP.

[15]  Miles Osborne,et al.  Using paraphrases for improving first story detection in news and Twitter , 2012, HLT-NAACL.

[16]  Sridhar Swaminathan,et al.  Real Time Event Detection Adopting Incremental TF-IDF based LSH and Event Summary Generation , 2018 .

[17]  Craig MacDonald,et al.  Enhancing First Story Detection using Word Embeddings , 2016, SIGIR.

[18]  David A. Clifton,et al.  A review of novelty detection , 2014, Signal Process..

[19]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.