Topic Detection and Tracking: Event Clustering as a Basis for First Story Detection

Topic Detection and Tracking (TDT) is a new research area that investigates the organization of information by event rather than by subject. In this paper, we provide an overview of the TDT research program from its inception to the third phrase that is now underway. We also discuss our approach to two of the TDT problems in detail. For event clustering (Detection), we show that classic Information Retrieval clustering techniques can be modified slightly to provide effective solutions. For first story detection, we show that similar methods provide satisfactory results, although substantial work remains. In both cases, we explore solutions that model the temporal relationship between news stories. We also investigate the use of phrase extraction to capture the who, what, when, and where contained in news.

[1]  Ellen Riloff,et al.  Information extraction as a basis for high-precision text classification , 1994, TOIS.

[2]  W. Bruce Croft,et al.  TREC and Tipster Experiments with Inquery , 1995, Inf. Process. Manag..

[3]  W. Bruce Croft,et al.  Text Segmentation by Topic , 1997, ECDL.

[4]  Richard M. Schwartz,et al.  Topic detection in broadcast news , 1999, EUROSPEECH.

[5]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[6]  James Allan,et al.  Document classification using multiword features , 1998, CIKM '98.

[7]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[8]  Eugene Charniak,et al.  Tree-Bank Grammars , 1996, AAAI/IAAI, Vol. 2.

[9]  Evelyne Tzoukermann,et al.  Effective use of natural language processing techniques for automatic conflation of multi-word terms: the role of derivational morphology, part of speech tagging, and shallow parsing , 1997, SIGIR '97.

[10]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[11]  Ellen M. Voorhees,et al.  The seventh text REtrieval conference (TREC-7) , 1999 .

[12]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[13]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.

[14]  Tom Fawcett,et al.  Robust Classification Systems for Imprecise Environments , 1998, AAAI/IAAI.

[15]  Thomas Peltier,et al.  NIST Special Publications , 2003 .

[16]  E. Voorhees The Effectiveness & Efficiency of Agglomerative Hierarchic Clustering in Document Retrieval , 1985 .

[17]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[18]  James Allan,et al.  Topic Detection and Tracking , 2002, The Information Retrieval Series.

[19]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[20]  Gerald Salton,et al.  Automatic text processing , 1988 .

[21]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[22]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[23]  Fazli Can,et al.  Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases , 1990, TODS.

[24]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[25]  J A Swets,et al.  Measuring the accuracy of diagnostic systems. , 1988, Science.

[26]  J. M. Schultz,et al.  Topic Detection and Tracking using idf-Weighted Cosine Coefficient , 1999 .

[27]  Tomek Strzalkowski,et al.  Natural Language Information Retrieval: TREC-8 Report , 1994, TREC.