论文信息 - Document Clustering and Text Summarization

Document Clustering and Text Summarization

This paper describes a text mining tool that performs two tasks, namely document clustering and text summarization. These tasks have, of course, their corresponding counterpart in “conventional” data mining. However, the textual, unstructured nature of documents makes these two text mining tasks considerably more difficult than their data mining counterparts. In our system document clustering is performed by using the Autoclass data mining algorithm. Our text summarization algorithm is based on computing the value of a TF-ISF (term frequency – inverse sentence frequency) measure for each word, which is an adaptation of the conventional TF-IDF (term frequency – inverse document frequency) measure of information retrieval. Sentences with high values of TF-ISF are selected to produce a summary of the source text. The system has been evaluated on real-world documents, and the results are satisfactory.

[1] James Allan,et al. Text retrieval using the vector processing model , 1994 .

[2] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.

[3] Gerard Salton,et al. Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[4] W. B. Cavnar,et al. Using An N-Gram-Based Document Representation With A Vector Processing Retrieval Model , 1994, TREC.

[5] James Kelly,et al. AutoClass: A Bayesian Classification System , 1993, ML.

[6] Inderjeet Mani,et al. The Tipster Summac Text Summarization Evaluation , 1999, EACL.

[7] Ian H. Witten,et al. Managing gigabytes , 1994 .

[8] Thorsten Joachims,et al. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[9] Haym Hirsh,et al. Mining Associations in Text in the Presence of Background Knowledge , 1996, KDD.

[10] Ido Dagan,et al. Knowledge Discovery in Textual Databases (KDT) , 1995, KDD.

[11] Yehuda Lindell,et al. TextVis: An Integrated Visual Environment for Text Mining , 1998, PKDD.

[12] Peter C. Cheeseman,et al. Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.