Lexical cohesion, discourse segmentation and document summarization

Summaries automatically derived by sentence extraction are known to exhibit some coherence degradation, readability deterioration, and topical under-representation. We propose a strategy for improving upon these problems, aiming to generate more cohesive summaries by analyzing the lexical cohesion factors in the source document texts. As an initial experiment, we have looked at one particular factor, lexical repetition, which is instrumental to the topical make-up of a text. We have developed a framework for integrating a lexical repetition-based model of discourse segmentation capable of detecting shifts in topic, with a linguistically-aware summarizer which utilizes notions of salience and dynamically-adjustable size of the resulting summaries. We show that even by utilizing lexical repetition alone, summaries are of comparable, and under certain conditions better, quality than those delivered by a state-of-the-art sentence-based summarizer. This is encouraging for a broad platform of research which seeks to position a framework for the recognition and use of a number of cohesive devices in text as instrumental in the development of a wide range of content characterisation and document management tasks.

[1]  John D. Lafferty,et al.  Text Segmentation Using Exponential Models , 1997, EMNLP.

[2]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[3]  Michael Halliday,et al.  Cohesion in English , 1976 .

[4]  Zunaid Kazi,et al.  Who's who? Identifying concepts and entities across multiple documents , 2000, Proceedings of the 33rd Annual Hawaii International Conference on System Sciences.

[5]  M. Osborne CoNLL-99. Computational Natural Language Learning. Proceedings of a Workshop Sponsored by The Association for Computational Linguistics , 1999 .

[6]  Mark Liberman,et al.  Annotation graphs as a framework for multidimensional linguistic data analysis , 1999, ArXiv.

[7]  Inderjeet Mani,et al.  The Tipster Summac Text Summarization Evaluation , 1999, EACL.

[8]  James W. Cooper,et al.  ASHRAM: active summarization and Markup , 1999, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers.

[9]  Andrew Kehler,et al.  Common Topics and Coherent Situations: Interpreting Ellipsis in the Context of Discourse Inference , 1994, ACL.

[10]  Gerard Salton,et al.  Automatic text decomposition using text segments and text themes , 1996, HYPERTEXT '96.

[11]  Min-Yen Kan,et al.  Linear Segmentation and Segment Significance , 1998, VLC@COLING/ACL.

[12]  W. Bruce Croft,et al.  Text Segmentation by Topic , 1997, ECDL.

[13]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[14]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[15]  B. Boguraev Dynamic presentation of document content for rapid on-line skimming , 1998, AAAI 1998.

[16]  Daniel Marcu,et al.  From discourse structures to text summaries , 1997 .

[17]  Lisa F. Rau,et al.  Automatic Condensation of Electronic Publications by Sentence Selection , 1995, Inf. Process. Manag..

[18]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[19]  John M. Prager,et al.  Linguini: language identification for multilingual documents , 1999, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers.

[20]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.

[21]  Graeme Hirst,et al.  Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text , 1991, CL.

[22]  Michael Hoey,et al.  Patterns of Lexis In Text , 1991 .

[23]  Branimir Boguraev,et al.  Anaphora for Everyone: Pronominal Anaphora Resolution without a Parser , 1996, COLING.

[24]  Mitchell P. Marcus,et al.  Topic segmentation: algorithms and applications , 1998 .

[25]  Mary Ellen Okurowski,et al.  A Scalable Summarization System Using Robust NLP , 1997 .

[26]  Nina Wacholder,et al.  Disambiguation of Proper Names in Text , 1997, ANLP.

[27]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[28]  Kavi Mahesh Hypertext Summary Extraction for Fast Document Browsing , 1997 .

[29]  Seiji Miike,et al.  A full-text retrieval system with a dynamic abstract generation function , 1994, SIGIR '94.

[30]  Regina Barzilay,et al.  Using Lexical Chains for Text Summarization , 1997 .

[31]  W. Grabe,et al.  Aspects of text structure : an investigation of the lexical organisation of text , 1987 .

[32]  Rachel K. E. Bellamy,et al.  Dynamic presentation of phrasally-based document abstractions , 1999, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers.

[33]  Marti A. Hearst TileBars: visualization of term distribution information in full text information access , 1995, CHI '95.