Multi-Paragraph Segmentation Expository Text

This paper describes TextTiling, an algorithm for partitioning expository texts into coherent multi-paragraph discourse units which reflect the subtopic structure of the texts. The algorithm uses domain-independent lexical frequency and distribution information to recognize the interactions of multiple simultaneous themes. Two fully-implemented versions of the algorithm are described and shown to produce segmentation that corresponds well to human judgments of the major subtopic boundaries of thirteen lengthy texts.

[1]  Heather A. Stark What do paragraph markings do , 1988 .

[2]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[3]  Hinrich Schütze,et al.  Word Space , 1992, NIPS.

[4]  Graeme Hirst,et al.  Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text , 1991, CL.

[5]  James Allan,et al.  Approaches to passage retrieval in full text information systems , 1993, SIGIR.

[6]  Robert E. Longacre,et al.  The Paragraph as a Grammatical Unit , 1979 .

[7]  William C. Mann,et al.  RHETORICAL STRUCTURE THEORY: A THEORY OF TEXT ORGANIZATION , 1987 .

[8]  Martha Alice Hearst Context and structure in automated full-text information access , 1994 .

[9]  Marti A. Hearst TextTiling: A Quantitative Approach to Discourse , 1993 .

[10]  W. G. Cochran The comparison of percentages in matched samples. , 1950, Biometrika.

[11]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[12]  Candace L. Sidner,et al.  Attention, Intentions, and the Structure of Discourse , 1986, CL.

[13]  P. Resnik Selection and information: a class-based approach to lexical relationships , 1993 .

[14]  Rebecca J. Passonneau,et al.  Intention-Based Segmentation: Human Reliability and Correlation with Linguistic Cues , 1993, ACL.

[15]  D. Tannen Talking Voices: Repetition, Dialogue, and Imagery in Conversational Discourse , 1989 .

[16]  David Yarowsky,et al.  A method for disambiguating word senses in a large corpus , 1992, Comput. Humanit..

[17]  Hideki Kozima,et al.  Text Segmentation Based on Similarity between Words , 1993, ACL.

[18]  David Yarowsky,et al.  Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora , 2010, COLING.

[19]  David Yarowsky,et al.  Estimating Upper and Lower Bounds on the Performance of Word-Sense Disambiguation Programs , 1992, ACL.

[20]  E. F. Skorochod'ko Adaptive Method of Automatic Abstracting and Indexing , 1971, IFIP Congress.

[21]  Marti A. Hearst Text tiling: A quantitative approach to discourse segmentation , 1993, ACL 1993.

[22]  Julia Hirschberg,et al.  Empirical Studies on the Disambiguation of Cue Phrases , 1993, Comput. Linguistics.

[23]  Wallace L. Chafe,et al.  The flow of thought and the flow of language , 1977 .

[24]  Gerald Salton,et al.  Automatic text processing , 1988 .

[25]  Ido Dagan,et al.  Contextual word similarity and estimation from sparse data , 1995, Comput. Speech Lang..

[26]  Marilyn A. Walker,et al.  Redundancy in Collaborative Dialogue , 1992, COLING.

[27]  Michael Halliday,et al.  Cohesion in English , 1976 .