The diversity-based approach to open-domain text summarization

The paper introduces a novel approach to unsupervised text summarization, which in principle should work for any domain or genre. The novelty lies in exploiting the diversity of concepts in text for summarization, which has not received much attention in the summarization literature. We propose, in addition, what we call the information-centric approach to evaluation, where the quality of summaries is judged not in terms of how well they match human-created summaries but in terms of how well they represent their source documents in IR tasks such document retrieval and text categorization. To find the effectiveness of our approach under the proposed evaluation scheme, we set out to examine how a system with the diversity functionality performs against one without, using the test data known as BMIR-J2. The results demonstrate a clear superiority of the diversity-based approach to a non-diversity-based approach.The paper also addresses the question of how closely the diversity approach models human judgments on summarization. We have created a relatively large volume of data annotated for relevance to summarization by human subjects. We have trained a decision tree-based summarizer using the data, and examined how the diversity method compares with the supervised method in performance when tested on the data. It was found that the diversity approach performs as well as and in some cases superior to the supervised method.

[1]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[2]  Gerard Salton,et al.  Automatic Text Structuring and Summarization , 1997, Inf. Process. Manag..

[3]  Daniel Marcu The rhetorical parsing of natural language texts , 1997 .

[4]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[5]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[6]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[7]  Daniel Marcu,et al.  The automatic construction of large-scale corpora for summarization research , 1999, SIGIR '99.

[8]  David G. Stork,et al.  Pattern Classification , 1973 .

[9]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents , 2004, Inf. Process. Manag..

[10]  Yuji Matsumoto,et al.  Data Reliability and Its Effects on Automatic Abstracting , 1997, VLC.

[11]  Daniel Marcu,et al.  Discourse Trees Are Good Indicators of Importance in Text , 1999 .

[12]  Yuji Matsumoto,et al.  Japanese Morphological Analysis System ChaSen version 2.0 Manual , 1999 .

[13]  Klaus Zechner,et al.  Fast Generation of Abstracts from General Domain Text Corpora by Extracting Relevant Sentences , 1996, COLING.

[14]  Yuji Matsumoto,et al.  Comparing the Minimum Description Length Principle and Boosting in the Automatic Analysis of Discourse , 2000, ICML.

[15]  Mark T. Maybury,et al.  Advances in Automatic Text Summarization , 1999 .

[16]  Kathleen R. McKeown,et al.  Summarization Evaluation Methods: Experiments and Analysis , 1998 .

[17]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[18]  J. Stochastic Complexity in Learning , .

[19]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[20]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[21]  Slava M. Katz Distribution of content words and phrases in text and language modelling , 1996, Natural Language Engineering.

[22]  Inderjeet Mani,et al.  The Tipster Summac Text Summarization Evaluation , 1999, EACL.

[23]  Hang Li,et al.  A Probabilistic Approach to Lexical Semantic Knowledge Acquisition and Structural Disambiguation , 1998, ArXiv.

[24]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[25]  Jihoon Yang,et al.  Text Summarization by Sentence Segment Extraction Using Machine Learning Algorithms , 2000, PAKDD.

[26]  Vibhu O. Mittal,et al.  Query-Relevant Summarization using FAQs , 2000, ACL.

[27]  George M. Kasper,et al.  The Effects and Limitations of Automated Text Condensing on Reading Comprehension Performance , 1992, Inf. Syst. Res..

[28]  Xin Liu,et al.  Generic text summarization using relevance measure and latent semantic analysis , 2001, SIGIR '01.

[29]  Karen Sparck Jones,et al.  Book Reviews: Evaluating Natural Language Processing Systems: An Analysis and Review , 1996, CL.

[30]  Yuji Matsumoto,et al.  A new approach to unsupervised text summarization , 2001, SIGIR '01.