The automatic construction of large-scale corpora for summarization research

Summarization research is notorious for its lack of adequat e corpora: today, there exist only a few small collections of texts whose units have been manually annotated for textual importance. Given the cost and tediousness of the annotation process, it is very unlikely that we will ever manually annotate for textual importance sufficiently large corpora f texts. To circumvent this problem, we have developed an algorithm that constructs such corpora automatically. Our algorithm takes as input an hAbstract, Text i tuple and generates the corresponding Extract, i.e., the set of clauses (sentences) in the Text that were used to write the Abstract. The performance of the algorithm is shown to be close to that of humans by means of an empirical experiment. The experiment also suggests extraction strategies that could impro ve the performance of automatic summarization systems.

[1]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[2]  Eduard Hovy,et al.  Automated Text Summarization in SUMMARIST , 1997, ACL 1997.

[3]  Inderjeet Mani,et al.  Machine Learning of Generic and User-Focused Summarization , 1998, AAAI/IAAI.

[4]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[5]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[6]  Kathleen R. McKeown,et al.  Summarization Evaluation Methods: Experiments and Analysis , 1998 .

[7]  Daniel Marcu,et al.  Discourse Trees Are Good Indicators of Importance in Text , 1999 .

[8]  Daniel Marcu The rhetorical parsing of natural language texts , 1997 .

[9]  Simone Teufel,et al.  Sentence extraction as a classification task , 1997 .

[10]  Tomek Strzalkowski,et al.  A Robust Practical Text Summarization , 1998 .

[11]  Branimir K. Boguraev,et al.  Salience-based Content Characterisafion of Text Documents , 1997 .

[12]  Daniel Marcu,et al.  The rhetorical parsing, summarization, and generation of natural language texts , 1998 .

[13]  Chris Buckley,et al.  Automatic Text Summarization by Paragraph Extraction , 1997 .

[14]  G Salton,et al.  Automatic Analysis, Theme Generation, and Summarization of Machine-Readable Texts , 1994, Science.

[15]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[16]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.