论文信息 - Topic Detection Using MFSs

Topic Detection Using MFSs

When analyzing a document collection, a key piece of information is the number of distinct topics it contains. Document clustering has been used as a tool to facilitate the extraction of such information. However, existing clustering methods do not take into account the sequences of the words in the documents, and usually do not have the means to describe the contents within each topic cluster. In this paper, we record our investigation and results using Maximal Frequent word Sequences (MFSs) as building blocks in identifying distinct topics. The supporting documents of MFSs are grouped into an equivalence class and then linked to a topic cluster, and the MFSs serve as the document cluster identifier. We describe the original method in extracting the set of MFSs, and how it can be adapted to identify topics in a textual dataset. We also demonstrate how the MFSs themselves can act as topic descriptors for the clusters. Finally, the benchmarking study with other existing clustering methods, i.e. k-Means and EM algorithm, shows the effectiveness of our approach for topic detection.

Han Tong Loh | Ying Liu | Lixiang Shen | Ivan Yap

[1] Michael McGill,et al. Introduction to Modern Information Retrieval , 1983 .

[2] Peter Willett,et al. Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[3] Chun Zhang,et al. Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[4] Heikki Mannila,et al. Efficient Algorithms for Discovering Association Rules , 1994, KDD Workshop.

[5] Tomasz Imielinski,et al. Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[6] Alan J. Wecker,et al. The Librarian's Assistant: Automatically Organizing On-line Books into Dynamic Bookshelves , 1994, RIAO.

[7] Yiming Yang,et al. Topic Detection and Tracking Pilot Study Final Report , 1998 .

[8] Young-Woo Seo,et al. Text clustering for topic detection , 2004 .

[9] Gerard Salton,et al. Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[10] Oren Etzioni,et al. Web document clustering: a feasibility demonstration , 1998, SIGIR '98.