Unsupervised Event Clustering in Multilingual News Streams

The Topic Detection and Tracking (TDT) benchmark evaluation project embraces a variety of technical challenges for information retrieval research. The TDT topic detection task is concerned with the unsupervised grouping of news stories according to the events they discuss. A detection system must both discover new events as the incoming stories are processed and associate incoming stories with the story clusters created so far. The TNO topic detection system is based on a language modeling approach. The system has been evaluated on a multilingual corpus of approximately 80.000 stories from multiple new sources. For the grouping of stories we combined a simple single pass method to establish an initial clustering and a reallocation method to stabilize the clusters within a certain allowed deferral period. The similarity of an incoming story to an existing cluster is defined as the average of the similarities of to each story . These individual similarities are computed by taking the sum of the generative probabilities and where and are modeled as unigram language models. Because these story language models are based on extremely sparse statistics, the word probabilities are smoothed using a background model.

[1]  Mark Liberman,et al.  Large, Multilingual, Broadcast News Corpora for Cooperative Research in Topic Detection and Tracking: The TDT-2 and TDT-3 Corpus Efforts , 2000, LREC.

[2]  Djoerd Hiemstra,et al.  Twenty-One at TREC7: Ad-hoc and Cross-Language Track , 1998, TREC.

[3]  Wessel Kraaij,et al.  Combining a mixture language model and Naive Bayes for multi-document summarisation , 2001 .

[4]  Wessel Kraaij,et al.  TNO TREC7 Site Report: SDR and Filtering , 1998, TREC.

[5]  Charles L. Wayne Multilingual Topic Detection and Tracking: Successful Research Enabled by Corpora and Evaluation , 2000, LREC.

[6]  Djoerd Hiemstra,et al.  The Importance of Prior Probabilities for Entry Page Search , 2002, SIGIR '02.

[7]  Jonathan Yamron,et al.  Dragon's Tracking and Detection Systems for the TDT2000 Evaluation , 2000 .

[8]  Wessel Kraaij,et al.  Using language models for tracking events of interest over time , 2001 .

[9]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[10]  Djoerd Hiemstra,et al.  Twenty-One at CLEF-2000: Translation Resources, Merging Strategies and Relevance Feedback , 2000, CLEF.

[11]  Djoerd Hiemstra,et al.  Twenty-One at TREC-8: using Language Technology for Information Retrieval , 1999, TREC.

[12]  Yiming Yang,et al.  Learning approaches for detecting and tracking news events , 1999, IEEE Intell. Syst..

[13]  Richard M. Schwartz,et al.  Topic detection in broadcast news , 1999, EUROSPEECH.