A Survey of Text Summarization Techniques

Numerous approaches for identifying important content for automatic text summarization have been developed to date. Topic representation approaches first derive an intermediate representation of the text that captures the topics discussed in the input. Based on these representations of topics, sentences in the input document are scored for importance. In contrast, in indicator representation approaches, the text is represented by a diverse set of possible indicators of importance which do not aim at discovering topicality. These indicators are combined, very often using machine learning techniques, to score the importance of each sentence. Finally, a summary is produced by selecting sentences in a greedy approach, choosing the sentences that will go in the summary one by one, or globally optimizing the selection, choosing the best set of sentences to form a summary. In this chapter we give a broad overview of existing approaches based on these distinctions, with particular attention on how representation, sentence scoring or summary selection strategies alter the overall performance of the summarizer. We also point out some of the peculiarities of the task of summarization which have posed challenges to machine learning approaches for the problem, and some of the suggested solutions.

[1]  Joshua Goodman,et al.  Multi-Document Summarization by Maximizing Informative Content-Words , 2007, IJCAI.

[2]  Liang Zhou,et al.  Multi-Document Biography Summarization , 2005, EMNLP.

[3]  Berlin Chen,et al.  Leveraging evaluation metric-related training criteria for speech summarization , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Manuel J. Maña López,et al.  Multidocument summarization: An added value to clustering in interactive retrieval , 2004, TOIS.

[5]  Shafiq R. Joty,et al.  Improving the Performance of the Random Walk Model for Answering Complex Questions , 2008, ACL.

[6]  Gustave J. Rath,et al.  The formation of abstracts by the selection of sentences , 1961 .

[7]  Kam-Fai Wong,et al.  Extractive Summarization Using Supervised and Semi-Supervised Learning , 2008, COLING.

[8]  Owen Rambow,et al.  Using Question-Answer Pairs in Extractive Summarization of Email Conversations , 2007, CICLing.

[9]  Horacio Rodríguez,et al.  Support Vector Machines for Query-focused Summarization trained and evaluated on Pyramid data , 2007, ACL.

[10]  Ahmet Aker,et al.  Multi-document summarization using A * search and discriminative training , 2013 .

[11]  Hugh E. Williams,et al.  Fast generation of result snippets in web search , 2007, SIGIR.

[12]  Robert L. Donaway,et al.  A Comparison of Rankings Produced by Summarization Evaluation Measures , 2000 .

[13]  Jinxi Xu,et al.  A Hybrid Approach to Answering Biographical Questions , 2004, New Directions in Question Answering.

[14]  Terry COPECK,et al.  Leveraging Pyramids , 2005 .

[15]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[16]  Giuseppe Carenini,et al.  Summarizing email conversations with clue words , 2007, WWW '07.

[17]  Wai Lam,et al.  Evaluation Challenges in Large-Scale Document Summarization , 2003, ACL.

[18]  Kathleen McKeown,et al.  Improving Word Sense Disambiguation in Lexical Chaining , 2003, IJCAI.

[19]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[20]  Ani Nenkova,et al.  A compositional context sensitive multi-document summarizer: exploring the factors that influence summarization , 2006, SIGIR.

[21]  Regina Barzilay,et al.  Sentence Alignment for Monolingual Comparable Corpora , 2003, EMNLP.

[22]  Marc Moens,et al.  Articles Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status , 2002, CL.

[23]  Ahmet Aker,et al.  Multi-Document Summarization Using A* Search and Discriminative Learning , 2010, EMNLP.

[24]  Miles Osborne,et al.  Using maximum entropy for sentence extraction , 2002, ACL 2002.

[25]  M. Litzow,et al.  Evolving paradigms in the therapy of Philadelphia-chromosome-negative acute lymphoblastic leukemia in adults. , 2009, Hematology. American Society of Hematology. Education Program.

[26]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[27]  Ani Nenkova,et al.  Discourse indicators for content selection in summarization , 2010, SIGDIAL Conference.

[28]  Mark T. Maybury,et al.  Automatic Summarization , 2002, Computational Linguistics.

[29]  Xiaojun Wan,et al.  Improved Affinity Graph Based Multi-Document Summarization , 2006, NAACL.

[30]  Julia Hirschberg,et al.  An Unsupervised Approach to Biography Production Using Wikipedia , 2008, ACL.

[31]  Eduard Hovy,et al.  Automated Text Summarization in SUMMARIST , 1997, ACL 1997.

[32]  Kathleen McKeown,et al.  DefScriber: a hybrid system for definitional QA , 2003, SIGIR '03.

[33]  Jean Carletta,et al.  Extractive summarization of meeting recordings , 2005, INTERSPEECH.

[34]  Hongyan Jing Using hidden Markov modeling to decompose human-written summaries : Summarization , 2002 .

[35]  G. Carenini,et al.  A Publicly Available Annotated Corpus for Supervised Email Summarization , 2008 .

[36]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[37]  David Reitter,et al.  Dimensionality Reduction Aids Term Co-Occurrence Based Multi-Document Summarization , 2006 .

[38]  Pascale Fung,et al.  One story, one flow: Hidden Markov Story Models for multilingual multidocument summarization , 2006, TSLP.

[39]  Ani Nenkova,et al.  Syntactic Simplification for Improving Content Selection in Multi-Document Summarization , 2004, COLING.

[40]  Daniel Marcu,et al.  A Phrase-Based HMM Approach to Document/Abstract Alignment , 2004, EMNLP.

[41]  Vasileios Hatzivassiloglou,et al.  A Formal Model for Information Selection in Multi-Sentence Text Extraction , 2004, COLING.

[42]  Hua Li,et al.  Document Summarization Using Conditional Random Fields , 2007, IJCAI.

[43]  Dragomir R. Radev,et al.  Biased LexRank: Passage retrieval using random walks with question-based priors , 2009, Inf. Process. Manag..

[44]  Inderjeet Mani,et al.  Summarizing Similarities and Differences Among Related Documents , 1997, Information Retrieval.

[45]  Liang Zhou,et al.  A Web-Trained Extraction Summarization System , 2003, NAACL.

[46]  Hui Lin,et al.  Multi-document Summarization via Budgeted Maximization of Submodular Functions , 2010, NAACL.

[47]  Dragomir R. Radev,et al.  LexRank: Graph-based Centrality as Salience in Text Summarization , 2004 .

[48]  Gökhan Tür,et al.  Statistical Sentence Extraction for Information Distillation , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[49]  Ani Nenkova,et al.  Measuring Importance and Query Relevance in Topic-focused Multi-document Summarization , 2007, ACL.

[50]  Akira Shimazu,et al.  Construction of Deliberation Structure in E‐Mail Communication , 2000, Comput. Intell..

[51]  Ferda Nur Alpaslan,et al.  Text Summarization of Turkish Texts using Latent Semantic Analysis , 2010, COLING.

[52]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[53]  Dilek Z. Hakkani-Tür,et al.  A global optimization framework for meeting summarization , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[54]  Kathleen McKeown,et al.  Detection of Question-Answer Pairs in Email Conversations , 2004, COLING.

[55]  Kathleen R. McKeown,et al.  SIMFINDER: A Flexible Clustering Tool for Summarization , 2001 .

[56]  Ani Nenkova,et al.  Automatically Evaluating Content Selection in Summarization without Human Models , 2009, EMNLP.

[57]  Ani Nenkova,et al.  Facilitating email thread access by extractive summary generation , 2003, RANLP.

[58]  Sanda M. Harabagiu,et al.  Topic themes for multi-document summarization , 2005, SIGIR '05.

[59]  Inderjeet Mani,et al.  Producing Biographical Summaries: Combining Linguistic Knowledge with Corpus Statistics , 2001, ACL.

[60]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[61]  Hongyan Jing,et al.  Using Hidden Markov Modeling to Decompose Human-Written Summaries , 2002, Computational Linguistics.

[62]  Tat-Seng Chua,et al.  Document concept lattice for text understanding and summarization , 2007, Inf. Process. Manag..

[63]  Vagelis Hristidis,et al.  A system for query-specific document summarization , 2006, CIKM '06.

[64]  Berlin Chen,et al.  A Risk Minimization Framework for Extractive Speech Summarization , 2010, ACL.

[65]  Regina Barzilay,et al.  Towards Multidocument Summarization by Reformulation: Progress and Prospects , 1999, AAAI/IAAI.

[66]  Kathleen R. McKeown,et al.  Experiments in multidocument summarization , 2002 .

[67]  Karel Jezek,et al.  Two uses of anaphora resolution in summarization , 2007, Inf. Process. Manag..

[68]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[69]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[70]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[71]  Min-Yen Kan,et al.  Customization in a unified framework for summarizing medical literature , 2005, Artif. Intell. Medicine.

[72]  Owen Rambow,et al.  Summarizing Email Threads , 2004, NAACL.

[73]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents , 2004, Inf. Process. Manag..

[74]  Eduard H. Hovy,et al.  The Automated Acquisition of Topic Signatures for Text Summarization , 2000, COLING.

[75]  Daniel Marcu,et al.  Bayesian Query-Focused Summarization , 2006, ACL.

[76]  Ee-Peng Lim,et al.  Comments-oriented blog summarization by sentence extraction , 2007, CIKM '07.

[77]  Hui Lin,et al.  Semi-supervised extractive speech summarization via co-training algorithm , 2010, INTERSPEECH.

[78]  Jure Leskovec,et al.  Impact of Linguistic Analysis on the Semantic Graph Coverage and Learning of Document Extracts , 2005, AAAI.

[79]  Cécile Paris,et al.  Automatically summarising Web sites: is there a way around it? , 2000, CIKM '00.

[80]  Kathleen F. McCoy,et al.  Efficiently Computed Lexical Chains as an Intermediate Representation for Automatic Text Summarization , 2002, CL.

[81]  Bernadette Bouchon-Meunier,et al.  Enhanced web document summarization using hyperlinks , 2003, HYPERTEXT '03.

[82]  ChengXiang Zhai,et al.  Generating Impact-Based Summaries for Scientific Literature , 2008, ACL.

[83]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[84]  Regina Barzilay,et al.  Automatically Generating Wikipedia Articles: A Structure-Aware Approach , 2009, ACL.

[85]  Dilek Z. Hakkani-Tür,et al.  A Hybrid Hierarchical Model for Multi-Document Summarization , 2010, ACL.

[86]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[87]  Ani Nenkova,et al.  Beyond SumBasic: Task-focused summarization with sentence simplification and lexical expansion , 2007, Information Processing & Management.

[88]  Shafiq R. Joty,et al.  Do Automatic Annotation Techniques Have Any Impact on Supervised Complex Question Answering? , 2009, ACL/IJCNLP.

[89]  Xin Liu,et al.  Generic text summarization using relevance measure and latent semantic analysis , 2001, SIGIR '01.

[90]  Dilek Z. Hakkani-Tür,et al.  Packing the meeting summarization knapsack , 2008, INTERSPEECH.

[91]  Yihong Gong,et al.  Multi-Document Summarization using Sentence-based Topic Models , 2009, ACL.

[92]  Gerard Salton,et al.  Automatic Text Structuring and Summarization , 1997, Inf. Process. Manag..

[93]  Rada Mihalcea,et al.  A Language Independent Algorithm for Single and Multiple Document Summarization , 2005, IJCNLP.

[94]  Michel Galley,et al.  A Skip-Chain Conditional Random Field for Ranking Meeting Utterances by Importance , 2006, EMNLP.

[95]  Lucy Vanderwende,et al.  Exploring Content Models for Multi-Document Summarization , 2009, NAACL.

[96]  Dianne P. O'Leary,et al.  Text summarization via hidden Markov models , 2001, SIGIR '01.

[97]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[98]  Jianfeng Gao,et al.  An Information-Theoretic Approach to Automatic Evaluation of Summaries , 2006, NAACL.

[99]  Dianne P. O'Leary,et al.  Topic-Focused Multi-Document Summarization Using an Approximate Oracle Score , 2006, ACL.

[100]  Yang Liu,et al.  Using corpus and knowledge-based similarity measure in Maximum Marginal Relevance for meeting summarization , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[101]  Hui Lin,et al.  Graph-based submodular selection for extractive summarization , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[102]  Wai Lam,et al.  MEAD - A Platform for Multidocument Multilingual Text Summarization , 2004, LREC.

[103]  John Blitzer,et al.  Summarizing archived discussions: a beginning , 2003, IUI '03.

[104]  Sadaoki Furui,et al.  Sentence-extractive automatic speech summarization and evaluation techniques , 2006, Speech Commun..

[105]  Daniel Marcu,et al.  The automatic construction of large-scale corpora for summarization research , 1999, SIGIR '99.

[106]  Ryan T. McDonald A Study of Global Inference Algorithms in Multi-document Summarization , 2007, ECIR.

[107]  Regina Barzilay,et al.  Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization , 2004, NAACL.