A Noisy-Channel Model for Document Compression

We present a document compression system that uses a hierarchical noisy-channel model of text production. Our compression system first automatically derives the syntactic structure of each sentence and the overall discourse structure of the text given as input. The system then uses a statistical hierarchical model of text production in order to drop non-important syntactic and discourse constituents so as to generate coherent, grammatical document compressions of arbitrary length. The system outperforms both a baseline and a sentence-based compression system that operates by simplifying sequentially all sentences in a text. Our results support the claim that discourse knowledge plays an important role in document summarization.

[1]  G. Meade Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001 .

[2]  Michele Banko,et al.  Headline Generation Based on Statistical Translation , 2000, ACL.

[3]  Michael Collins,et al.  Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[4]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[5]  Kavi Mahesh Hypertext Summary Extraction for Fast Document Browsing , 1997 .

[6]  Mark T. Maybury,et al.  Automatic Summarization , 2002, Computational Linguistics.

[7]  Lynette Hirschman,et al.  Deep Read: A Reading Comprehension System , 1999, ACL.

[8]  cationR. Chandrasekar,et al.  Motivations and Methods for Text Simpli , 1996 .

[9]  Daniel Marcu,et al.  Statistics-Based Summarization - Step One: Sentence Compression , 2000, AAAI/IAAI.

[10]  Mark T. Maybury,et al.  Advances in Automatic Text Summarization , 1999 .

[11]  Gregory Grefenstette Producing Intelligent Telegraphic Text Reduction to provide an Audio Scanning Service for the Blind , 1998 .

[12]  Irene Langkilde Forest-Based Statistical Sentence Generation , 2000, ANLP.

[13]  Irene Langkilde-Geary,et al.  Forest-Based Statistical Sentence Generation , 2000, ANLP.

[14]  Daniel Marcu,et al.  Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001, SIGDIAL Workshop.

[15]  Raman Chandrasekar,et al.  Motivations and Methods for Text Simplification , 1996, COLING.

[16]  Elizabeth D. Liddy,et al.  Advances in Automatic Text Summarization , 2001, Information Retrieval.

[17]  Vibhu O. Mittal,et al.  Query-Relevant Summarization using FAQs , 2000, ACL.

[18]  Hongyan Jing,et al.  Sentence Reduction for Automatic Text Summarization , 2000, ANLP.