Building better corpora for summarisation

Annotated corpora have proved essential in many areas of NLP, and over the years have been successfully exploited for a wide range of tasks in Computational Linguistics, including part-of-speech tagging, parsing and information extraction. One field in which they have been particularly useful is automatic summarisation. Within this field, annotated corpora are mainly used for machine learning, to learn patterns for the extraction of important (and other) information from texts, as well as for the more complex task of evaluation of summarisation methods (Edmundson, 1969; Kupiec, Pederson and Chen, 1995; Zechner, 1996; Marcu, 1997). When annotating corpora, one (accurate) method is to employ humans to indicate those parts of text to be annotated with whatever information necessary. These human-selected units of text can then be used as a gold standard by which to measure the performance of a system, as well as for discerning which types of units are chosen or discarded by humans during the summarisation process. There are semi-automatic (Orasan, 2002) and automatic (Jing and McKeown, 1999; Marcu, 1999) ways to annotate corpora, but given that we are investigating new types of information to be marked, manual annotation is most appropriate here. Despite the fact that they are vital to the field, corpora annotated for summarisation are relatively sparse, and those resources which do exist do not contain as much information as they could.

[1]  Kathleen McKeown,et al.  The decomposition of human-written summary sentences , 1999, SIGIR '99.

[2]  Daniel Marcu,et al.  The automatic construction of large-scale corpora for summarization research , 1999, SIGIR '99.

[3]  Helen R. Tibbo The art of abstracting , 1997 .

[4]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[5]  Constantin Orasan,et al.  Building annotated resources for automatic text summarisation , 2002, LREC.

[6]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[7]  Klaus Zechner,et al.  Fast Generation of Abstracts from General Domain Text Corpora by Extracting Relevant Sentences , 1996, COLING.

[8]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[9]  Daniel Marcu,et al.  From discourse structures to text summaries , 1997 .

[10]  Mark Stevenson,et al.  The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources , 2002, LREC.

[11]  Simone Teufel,et al.  Sentence extraction as a classification task , 1997 .

[12]  R. Mitkov,et al.  Coreference and anaphora: developing annotating tools, annotated resources and annotation strategies , 2000 .

[13]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences. , 1957 .

[14]  Eduard Hovy,et al.  Manual and automatic evaluation of summaries , 2002, ACL 2002.

[15]  Jean-Pierre Desclés,et al.  Knowledge-Based Automatic Abstracting: Experiments in the Sublanguage of Elementary Geometry , 1994 .