TNO at TDT2001: Language Model-Based Topic Detection

Topic detection is concerned with the unsupervised clustering of news stories over time. The TNO topic detection system is based on a language modeling approach. For the grouping of stories we combined a simple single pass method to establish an initial clustering and a reallocation method to stabilize the clusters within a certain allowed deferral period. The similarity of an incoming story to an existing cluster is defined as the average of the similarities of to each story . These individual similarities are computed by taking the sum of the generative probabilities and where and are modeled as unigram language models. Because these story language models are based on extremely sparse statistics, the word probabilities are smoothed using a background model.