Summarizing Spoken Documents: avoiding distracting content

Driven by a cognitive perspective of the human summarization process, we address the problem of assessing the most relevant information of a single spoken language document, by minimizing the influence of distracting content, of which passages particularly affected by spoken language-related problems are major representatives. Two different approaches are considered. One, based only on the input source to be summarized, consists in a centrality-based relevance model for automatic summarization that uses support sets to better estimate the relevant content. Geometric proximity is used to compute semantic relatedness. Relevance is determined by considering the whole input source, and by assuming that information sources to be summarized comprehend different topics. A thorough evaluation shows statistically significant improvements over previous approaches. The other mimics the natural human behavior, in which information acquired from different sources is used to build a better understanding of a given topic. Information from different types of sources and of the same type is explored. A multi-document summarization framework provides the means to assess the relevant content. A perceptual evaluation shows that mixing information leads to considerably better results, both in terms of informativeness and readability. Concerning the use of information of the same type, results show that background information of the same topic clearly improves the detection of the most important content.

[1]  Xiaojun Wan,et al.  EUSUM: extracting easy-to-understand english summaries for non-native readers , 2010, SIGIR.

[2]  Brigitte Endres-Niggemeyer,et al.  SimSum: an empirically founded simulation of summarizing , 2000, Inf. Process. Manag..

[3]  Brigitte Endres-Niggemeyer,et al.  Summarizing information , 1998 .

[4]  Dragomir R. Radev,et al.  LexRank: Graph-based Centrality as Salience in Text Summarization , 2004 .

[5]  Ricardo Ribeiro,et al.  Revisiting Centrality-as-Relevance: Support Sets and Similarity as Geometric Proximity: Extended abstract , 2013, IJCAI.

[6]  Rada Mihalcea,et al.  A Language Independent Algorithm for Single and Multiple Document Summarization , 2005, IJCNLP.

[7]  Lucas Antiqueira,et al.  A complex network approach to text summarization , 2009, Inf. Sci..

[8]  Gerda Ruge,et al.  Experiments on Linguistically-Based Term Associations , 1992, Inf. Process. Manag..

[9]  Dilek Z. Hakkani-Tür,et al.  Clusterrank: a graph based method for meeting summarization , 2009, INTERSPEECH.

[10]  Brigitte Endres-niggemeyer Human-style WWW summarization , 2000 .

[11]  Kathleen R. McKeown,et al.  A description of the CIDR system as used for TDT-2 , 1999 .

[12]  Ricardo Ribeiro,et al.  Extractive Summarization of Broadcast News: Comparing Strategies for European Portuguese , 2007, TSD.

[13]  João Paulo da Silva Neto,et al.  A Prototype System for Selective Dissemination of Broadcast News in European Portuguese , 2007, EURASIP J. Adv. Signal Process..

[14]  Berlin Chen,et al.  Extractive speech summarization - from the view of decision theory , 2010, INTERSPEECH.

[15]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies , 2000, ArXiv.

[16]  Thierry Poibeau,et al.  Multi-source, Multilingual Information Extraction and Summarization , 2012, Theory and Applications of Natural Language Processing.

[17]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[18]  Ani Nenkova,et al.  Automatic Summarization , 2011, ACL.

[19]  Julia Hirschberg,et al.  Intonational phrases for speech summarization , 2008, INTERSPEECH.

[20]  Ricardo Ribeiro,et al.  Improving Speech-to-Text Summarization by Using Additional Information Sources , 2013, Multi-source, Multilingual Information Extraction and Summarization.

[21]  Ricardo Ribeiro,et al.  Summarizing Speech by Contextual Reinforcement of Important Passages , 2012, PROPOR.

[22]  Maria Pinto Molina,et al.  Documentary Abstracting: Toward a Methodological Model , 1995, J. Am. Soc. Inf. Sci..

[23]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[24]  Berlin Chen,et al.  A Risk Minimization Framework for Extractive Speech Summarization , 2010, ACL.

[25]  Julia Hirschberg,et al.  Comparing lexical, acoustic/prosodic, structural and discourse features for speech summarization , 2005, INTERSPEECH.

[26]  Maria das Graças Volpe Nunes,et al.  A comprehensive comparative evaluation of RST-based summarization methods , 2010, TSLP.

[27]  Xin Liu,et al.  Generic text summarization using relevance measure and latent semantic analysis , 2001, SIGIR '01.

[28]  Alexander H. Waibel,et al.  Minimizing Word Error Rate in Textual Summaries of Spoken Language , 2000, ANLP.

[29]  Elena Lloret,et al.  Quantifying the Limits and Success of Extractive Summarization Systems Across Domains , 2010, HLT-NAACL.