Segmentation of Expository Texts by Hierarchical Agglomerative Clustering

We propose a method for segmentation of expository texts based on hierarchical agglomerative clustering. The method uses paragraphs as the basic segments for identifying hierarchical discourse structure in the text, applying lexical similarity between them as the proximity test. Linear segmentation can be induced from the identified structure through application of two simple rules. However the hierarchy can be used also for intelligent exploration of the text. The proposed segmentation algorithm is evaluated against an accepted linear segmentation method and shows comparable results.

[1]  Gerard Salton,et al.  Automatic Text Decomposition and Structuring , 1994, Inf. Process. Manag..

[2]  Hideki Kozima,et al.  Text Segmentation Based on Similarity between Words , 1993, ACL.

[3]  Wendy G. Lehnert,et al.  Corpus-Driven Knowledge Acquisition for Discourse Analysis , 1994, AAAI.

[4]  Graeme Hirst,et al.  Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text , 1991, CL.

[5]  Brett Kessler,et al.  Computational dialectology in Irish Gaelic , 1995, EACL.

[6]  David E. Kieras,et al.  A model of reader strategy for abstracting main ideas from simple technical prose , 1982 .

[7]  Robert E. Longacre,et al.  The Paragraph as a Grammatical Unit , 1979 .

[8]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[9]  Wallace L. Chafe,et al.  The flow of thought and the flow of language , 1977 .

[10]  Rebecca J. Passonneau,et al.  Combining Multiple Knowledge Sources for Discourse Segmentation , 1995, ACL.

[11]  Alan J. Wecker,et al.  The Librarian's Assistant: Automatically Organizing On-line Books into Dynamic Bookshelves , 1994, RIAO.

[12]  Michael Halliday,et al.  Cohesion in English , 1976 .

[13]  Udo Hahn,et al.  Topic parsing: Accounting for text macro structures in full-text analysis , 1990, Inf. Process. Manag..

[14]  Geoffrey Leech,et al.  100 Million Words of English:The British National Corpus (BNC) , 1992 .

[15]  William C. Mann,et al.  RHETORICAL STRUCTURE THEORY: A THEORY OF TEXT ORGANIZATION , 1987 .

[16]  Marti A. Hearst Multi-Paragraph Segmentation of Expository Texts , 1994 .

[17]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.

[18]  Ralph Grishman,et al.  Computational Aspects of Discourse in the Context of MUC-3 , 1991, MUC.

[19]  Scott Weinstein,et al.  Centering: A Framework for Modeling the Local Coherence of Discourse , 1995, CL.

[20]  Hinrich Schütze,et al.  Word Space , 1992, NIPS.

[21]  G. W. Milligan,et al.  A Two-Stage Clustering Algorithm with Robust Recovery Characteristics , 1980 .

[22]  Candace L. Sidner,et al.  Attention, Intentions, and the Structure of Discourse , 1986, CL.

[23]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[24]  Makoto Nagao,et al.  Automatic Detection of Discourse Structure by Checking Surface Information in Sentences , 1994, COLING.