论文信息 - Statistical Models for Topic Segmentation

Statistical Models for Topic Segmentation

Most documents are about more than one subject, but many NLP and IR techniques implicitly assume documents have just one topic. We describe new clues that mark shifts to new topics, novel algorithms for identifying topic boundaries and the uses of such boundaries once identified. We report topic segmentation performance on several corpora as well as improvement on an IR task that benefits from good segmentation.

Jeffrey C. Reynar

[1] W. Bruce Croft,et al. Text Segmentation by Topic , 1997, ECDL.

[2] Chris Buckley,et al. Pivoted Document Length Normalization , 1996, SIGIR Forum.

[3] Graeme Hirst,et al. Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text , 1991, CL.

[4] William A. Gale,et al. Good-Turing Smoothing Without Tears , 2001 .

[5] Jeffrey C. Reynar. An Automatic Method of Finding Topic Boundaries , 1994, ACL.

[6] Hideki Kozima,et al. Text Segmentation Based on Similarity between Words , 1993, ACL.

[7] Dania Egedi,et al. A Freely Available Wide Coverage Morphological Analyzer for English , 1992, COLING.

[8] G. Youmans. A New Tool for Discourse Analysis: The Vocabulary-Management Profile. , 1991 .

[9] Slava M. Katz. Distribution of content words and phrases in text and language modelling , 1996, Natural Language Engineering.

[10] Chris Buckley,et al. Implementation of the SMART Information Retrieval System , 1985 .

[11] Jonathan Helfman. Similarity patterns in language , 1994, Proceedings of 1994 IEEE Symposium on Visual Languages.