Paragraph-, Word-, and Coherence-based Approaches to Sentence Ranking: A Comparison of Algorithm and Human Performance

Sentence ranking is a crucial part of generating text summaries. We compared human sentence rankings obtained in a psycholinguistic experiment to three different approaches to sentence ranking: A simple paragraph-based approach intended as a baseline, two word-based approaches, and two coherence-based approaches. In the paragraph-based approach, sentences in the beginning of paragraphs received higher importance ratings than other sentences. The word-based approaches determined sentence rankings based on relative word frequencies (Luhn (1958); Salton & Buckley (1988)). Coherence-based approaches determined sentence rankings based on some property of the coherence structure of a text (Marcu (2000); Page et al. (1998)). Our results suggest poor performance for the simple paragraph-based approach, whereas word-based approaches perform remarkably well. The best performance was achieved by a coherence-based approach where coherence structures are represented in a non-tree structure. Most approaches also outperformed the commercially available MSWord summarizer.

[1]  Chris Buckley,et al.  Automatic Text Summarization by Paragraph Extraction , 1997 .

[2]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[3]  Kathleen R. McKeown,et al.  Summarization Evaluation Methods: Experiments and Analysis , 1998 .

[4]  Julia Hirschberg,et al.  A Prosodic Analysis of Discourse Segments in Direction-Giving Monologues , 1996, ACL.

[5]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[6]  Jade Goldstein-Stewart,et al.  Summarizing text documents: sentence selection and evaluation metrics , 1999, SIGIR '99.

[7]  Alex Lascarides,et al.  Temporal interpretation, discourse relations and commonsense entailment , 1993, The Language of Time - A Reader.

[8]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[9]  Karen Spärck Jones What Might be in a Summary? , 1993, Information Retrieval.

[10]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[11]  S. Corston-Oliver,et al.  Computing representations of the structure of written discourse , 1998 .

[12]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[13]  Dragomir R. Radev,et al.  Introduction to the Special Issue on Summarization , 2002, CL.

[14]  J. Hobbs On the coherence and structure of discourse , 1985 .

[15]  Lisa F. Rau,et al.  Automatic Condensation of Electronic Publications by Sentence Selection , 1995, Inf. Process. Manag..

[16]  G. Meade Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001 .

[17]  Andreas Paepcke,et al.  Seeing the whole in parts: text summarization for web browsing on handheld devices , 2001, WWW '01.

[18]  Karen Spärck Jones,et al.  Generic summaries for indexing in information retrieval , 2001, SIGIR '01.

[19]  Xin Liu,et al.  Generic text summarization using relevance measure and latent semantic analysis , 2001, SIGIR '01.

[20]  Chris H. Q. Ding,et al.  PageRank, HITS and a unified framework for link analysis , 2002, SIGIR '02.

[21]  Seiji Miike,et al.  Abstract Generation Based on Rhetorical Structure Extraction , 1994, COLING.

[22]  Klaus Zechner,et al.  Fast Generation of Abstracts from General Domain Text Corpora by Extracting Relevant Sentences , 1996, COLING.

[23]  D. Horn A correction for the effect of tied ranks on the value of the rank difference correlation coefficient. , 1942 .

[24]  Daniel Marcu,et al.  Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001, SIGDIAL Workshop.

[25]  Candace L. Sidner,et al.  Attention, Intentions, and the Structure of Discourse , 1986, CL.

[26]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..