Discourse segmentation in aid of document summarization

This paper describes work to enhance a sentence-based summarizer with notions of salience, dynamically adjustable summary size, discourse segmentation, and awareness of topic shifts. Our experiments study strategies to diversify the application of a baseline summarizer, by making it aware of finer-grained 'aboutness', capable of discerning changes of topic, and sensitive to longer-than-usual documents. Evaluated against the corpus used in the development of the baseline summarizer, summaries derived either by means of segmentation analysis alone, or by a mix of strategies for combining salience calculation and topic shift detection, are shown to be of comparable, and under certain conditions even better quality. We describe the summarization and segmentation procedures, outline a number of strategies for mixing the two, evaluate the overall impact of discourse segmentation, and suggest an interface design capable of using the notion of topic shifts to contextualize a summary and facilitate the mediation between it and the full document source.

[1]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.

[2]  J. M. Prager Linguini: language identification for multilingual documents , 1999 .

[3]  Min-Yen Kan,et al.  Linear Segmentation and Segment Significance , 1998, VLC@COLING/ACL.

[4]  Nina Wacholder,et al.  Disambiguation of Proper Names in Text , 1997, ANLP.

[5]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[6]  Kavi Mahesh Hypertext Summary Extraction for Fast Document Browsing , 1997 .

[7]  Mary Ellen Okurowski,et al.  A Scalable Summarization System Using Robust NLP , 1997 .

[8]  Chris Buckley,et al.  Automatic Text Summarization by Paragraph Extraction , 1997 .

[9]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[10]  Francine R. Chen,et al.  The use of emphasis to automatically summarize a spoken discourse , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[12]  John D. Lafferty,et al.  Text Segmentation Using Exponential Models , 1997, EMNLP.

[13]  Marti A. Hearst TileBars: visualization of term distribution information in full text information access , 1995, CHI '95.

[14]  James W. Cooper,et al.  A KNOWLEDGE MANAGEMENT PROTOTYPE , 1999 .

[15]  Therese Firmin Hand,et al.  A Proposal for Task-based Evaluation of Text Summarization Systems , 1997, Workshop On Intelligent Scalable Text Summarization.

[16]  W. Bruce Croft,et al.  Text Segmentation by Topic , 1997, ECDL.

[17]  Seiji Miike,et al.  A full-text retrieval system with a dynamic abstract generation function , 1994, SIGIR '94.

[18]  Graeme Hirst,et al.  Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text , 1991, CL.

[19]  Rachel K. E. Bellamy,et al.  Dynamic presentation of phrasally-based document abstractions , 1999, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers.

[20]  Frances C. Johnson,et al.  The application of linguistic processing to automatic abstract generation , 1997 .

[21]  Branimir Boguraev,et al.  Anaphora for Everyone: Pronominal Anaphora Resolution without a Parser , 1996, COLING.

[22]  Mitchell P. Marcus,et al.  Topic segmentation: algorithms and applications , 1998 .

[23]  J. Nunamaker,et al.  Proceedings of the 32nd Hawaii International Conference on System Sciences , 1999 .

[24]  Nina Wacholder,et al.  Extracting Names from Natural-Language Text , 2000 .

[25]  Inderjeet Mani,et al.  The Tipster Summac Text Summarization Evaluation , 1999, EACL.

[26]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[27]  Chris D. Paice,et al.  The identification of important concepts in highly structured technical papers , 1993, SIGIR.

[28]  Seiji Miike,et al.  Document structure extraction for interactive document retrieval systems , 1993, SIGDOC '93.

[29]  Gerard Salton,et al.  Automatic text decomposition using text segments and text themes , 1996, HYPERTEXT '96.

[30]  Chris D. Paice,et al.  Constructing literature abstracts by computer: Techniques and prospects , 1990, Inf. Process. Manag..

[31]  James W. Cooper,et al.  ASHRAM: active summarization and Markup , 1999, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers.

[32]  Gustave J. Rath,et al.  The formation of abstracts by the selection of sentences , 1961 .

[33]  Lisa F. Rau,et al.  Automatic Condensation of Electronic Publications by Sentence Selection , 1995, Inf. Process. Manag..

[34]  Zunaid Kazi,et al.  Is Hillary Rodham Clinton the President? Disambiguating Names across Documents , 1999, COREF@ACL.