论文信息 - A Scalable Summarization System Using Robust NLP

A Scalable Summarization System Using Robust NLP

We describe a scalable summarization system which takes advantage of robust NLP technology such as corpus-based statlshcal NLP techmques, information extractmn and readily available on-hne resources The system attempts to compensate for the bottlenecks of traditional frequency-based, knowledge-based or discourse-based summanzatlon approaches by uhhzlng features derived by these robust techniques Prehrmnary evaluation results are reported, and the multi-dimensional summary viewer is described 1 I n t r o d u c t i o n Summarization research and system development can be broadly characterized as frequency-based, knowledge-based or discourse-based These categories correspond to a continuum of increasing understanding of a text and increasing complextty in text processing Earliest attempts at summarization (Luhn, 1958, Edmundson, 1969, Rush, Salvador, and Zamora, 1971) essentially rehed on lexlcal and locahonal mformation within the text, 1 e , frequency of words or key terms, their proxnmty, and locatmn within the text More recent adaptations of tlns approach have employed an automated method to combine these types of feature sets through classification techniques (Kupmc, Pedersen, and Chen, 1995) O r have drawn upon tradlhonal information retrieval indexing methods to incorporate knowledge of a text corpus (Brandow, Mltze, and Rau, 1995) To a large extent, these types of shallow approaches are ignorant of dommn knowledge and the text macrostructure They create summaries by extracting sentences from the original document Knowledge-based approaches generally depend on rich domain knowledge sources to interpret the conceptual structure of the text Systems like TOPIC (Relmer and Hahn, 1988), SUSY (Fum, Gmda, and Tasso, 1985) or SCISORS (Ran, Jacobs, and Zermk, 1989) parse domaan specific texts and create conceptual representahons for the generation of text summarms These types of knowledgeba.~d systems apply knowledge of the domain to characterize specific conceptual knowledge of a text Palce (Pvace and Jones, 1993) provides a good example of the role of thts conceptual mformahon and thloff(Rlloff, 1995) gives a method for automahcally identifying relevant concepts lughly correlated with a category of interest Because these systems create a rich conceptual representation, there are multiple ways m whlcha text summary may be created For example, SUMMONS (McKeown and Radev; 1995) generates a text summary from such a template representahon, whle (Maybury, 1995) describes mulhpie methods for selecting events and presenting event summaries Knowledge-based approaches are usually very knowledge-intensive and domvan-specific Discourse-based approaches are grounded m theorms of text cohesion and coherence and vary conmderably m how much they push the lmnts of text understanding and the complemty as well as automahon of that processing Spearheaded by the lack of cohesion and coherence m extracts produced by frequency-based approaches, much of the work typifying discourse-based approaches focuses on lmgmstic processing of the text to identify the best cohesive sentence candidates (Palce, 1990, Johnson et al , 1993) or the best sentence candidates for represent" mg the rhetorical structure of the text (Mnke et al , • 1994) Both approaches revolve parsing the text and analyzing dlscoarse relations to select sentences for extractmn Frequency-based approaches (Brandow, Mltze, and Rau, 1995) may incorporate heurmhcs to handle readabilityrelated issues and knowledge-based approaches • systematically perform discourse processmg m analyzing and condeusmg the text, but m a broad classificatmn schema It is the discoursebased approaches that tend to focus on the text macrostructure and surface clues to that structure At the far end of the continuum lies work by Sparck Jones (Jones, 1993, Jones, 1995) m describing a

[1] James E. Rush,et al. Automatic abstracting and indexing. II. Production of indicative abstracts by application of contextual inference and syntactic coherence criteria , 1971 .

[2] Michael E. Lesk,et al. Computer Evaluation of Indexing and Text Processing , 1968, JACM.