Statistical Models for Topic Segmentation

Most documents are about more than one subject, but many NLP and IR techniques implicitly assume documents have just one topic. We describe new clues that mark shifts to new topics, novel algorithms for identifying topic boundaries and the uses of such boundaries once identified. We report topic segmentation performance on several corpora as well as improvement on an IR task that benefits from good segmentation.

[1]  W. Bruce Croft,et al.  Text Segmentation by Topic , 1997, ECDL.

[2]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[3]  Graeme Hirst,et al.  Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text , 1991, CL.

[4]  William A. Gale,et al.  Good-Turing Smoothing Without Tears , 2001 .

[5]  Jeffrey C. Reynar An Automatic Method of Finding Topic Boundaries , 1994, ACL.

[6]  Hideki Kozima,et al.  Text Segmentation Based on Similarity between Words , 1993, ACL.

[7]  Dania Egedi,et al.  A Freely Available Wide Coverage Morphological Analyzer for English , 1992, COLING.

[8]  G. Youmans A New Tool for Discourse Analysis: The Vocabulary-Management Profile. , 1991 .

[9]  Slava M. Katz Distribution of content words and phrases in text and language modelling , 1996, Natural Language Engineering.

[10]  Chris Buckley,et al.  Implementation of the SMART Information Retrieval System , 1985 .

[11]  Jonathan Helfman Similarity patterns in language , 1994, Proceedings of 1994 IEEE Symposium on Visual Languages.

[12]  Mitchell P. Marcus,et al.  Topic segmentation: algorithms and applications , 1998 .

[13]  John D. Lafferty,et al.  Text Segmentation Using Exponential Models , 1997, EMNLP.

[14]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.

[15]  Andrew Smith,et al.  Detecting Subject Boundaries Within Text: A Language Independent Statistical Approach , 1997, EMNLP.

[16]  Julia Hirschberg,et al.  Intonational Features of Local and Global Discourse Structure , 1992, HLT.

[17]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[18]  Kenneth Ward Church,et al.  Inverse Document Frequency (IDF): A Measure of Deviations from Poisson , 1995, VLC@ACL.

[19]  Julia Hirschberg,et al.  Empirical Studies on the Disambiguation of Cue Phrases , 1993, Comput. Linguistics.

[20]  Randy M. Kaplan,et al.  An Automatic Scoring System For Advanced Placement Biology Essays , 1997, ANLP.

[21]  Candace L. Sidner,et al.  Attention, Intentions, and the Structure of Discourse , 1986, CL.

[22]  Michael Halliday,et al.  Cohesion in English , 1976 .

[23]  Gerard Salton,et al.  Automatic text structuring experiments , 1992 .

[24]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.