Exploiting relevance, coverage, and novelty for query-focused multi-document summarization

Summarization plays an increasingly important role with the exponential document growth on the Web. Specifically, for query-focused summarization, there exist three challenges: (1) how to retrieve query relevant sentences; (2) how to concisely cover the main aspects (i.e., topics) in the document; and (3) how to balance these two requests. Specially for the issue relevance, many traditional summarization techniques assume that there is independent relevance between sentences, which may not hold in reality. In this paper, we go beyond this assumption and propose a novel Probabilistic-modeling Relevance, Coverage, and Novelty (PRCN) framework, which exploits a reference topic model incorporating user query for dependent relevance measurement. Along this line, topic coverage is also modeled under our framework. To further address the issues above, various sentence features regarding relevance and novelty are constructed as features, while moderate topic coverage are maintained through a greedy algorithm for topic balance. Finally, experiments on DUC2005 and DUC2006 datasets validate the effectiveness of the proposed method.

[1]  N. Metropolis THE BEGINNING of the MONTE CARLO METHOD , 2022 .

[2]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[3]  Tao Li,et al.  Multi-Document Summarization via the Minimum Dominating Set , 2010, COLING.

[4]  ChengXiang Zhai,et al.  Probabilistic Relevance Models Based on Document and Query Generation , 2003 .

[5]  Giuseppe Carenini,et al.  Regression-Based Summarization of Email Conversations , 2009, ICWSM.

[6]  Xin Liu,et al.  Generic text summarization using relevance measure and latent semantic analysis , 2001, SIGIR '01.

[7]  Alex Alves Freitas,et al.  Automatic Text Summarization Using a Machine Learning Approach , 2002, SBIA.

[8]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[9]  C. J. van Rijsbergen,et al.  Query-Sensitive Similarity Measures for Information Retrieval , 2003, Knowledge and Information Systems.

[10]  S. Robertson The probability ranking principle in IR , 1997 .

[11]  Hua Li,et al.  Document Summarization Using Conditional Random Fields , 2007, IJCAI.

[12]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[13]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[14]  David Cohn,et al.  Learning to Probabilistically Identify Authoritative Documents , 2000, ICML.

[15]  Yulan He,et al.  Joint sentiment/topic model for sentiment analysis , 2009, CIKM.

[16]  Hongyuan Zha,et al.  Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering , 2002, SIGIR '02.

[17]  Vibhu O. Mittal,et al.  Query-Relevant Summarization using FAQs , 2000, ACL.

[18]  Yihong Gong,et al.  Multi-Document Summarization using Sentence-based Topic Models , 2009, ACL.

[19]  Xiaojun Wan,et al.  Multi-document summarization using cluster-based link analysis , 2008, SIGIR '08.

[20]  Xuan Li,et al.  Exploiting novelty, coverage and balance for topic-focused multi-document summarization , 2010, CIKM '10.

[21]  Tapas Kanungo,et al.  Machine Learned Sentence Selection Strategies for Query-Biased Summarization , 2008 .

[22]  Zhi-Hua Zhou,et al.  Query-Sensitive Similarity Measure for Content-Based Image Retrieval , 2006, Sixth International Conference on Data Mining (ICDM'06).

[23]  John D. Lafferty,et al.  Beyond independent relevance: methods and evaluation metrics for subtopic retrieval , 2003, SIGIR.

[24]  Dianne P. O'Leary,et al.  Text summarization via hidden Markov models , 2001, SIGIR '01.

[25]  Ani Nenkova,et al.  Measuring Importance and Query Relevance in Topic-focused Multi-document Summarization , 2007, ACL.

[26]  David R. Karger,et al.  Less is More Probabilistic Models for Retrieving Fewer Relevant Documents , 2006 .

[27]  Yuji Matsumoto,et al.  A new approach to unsupervised text summarization , 2001, SIGIR '01.

[28]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[29]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[30]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[31]  Yong Yu,et al.  Enhancing diversity, coverage and balance for summarization through structure learning , 2009, WWW '09.

[32]  Xiaojun Wan,et al.  Manifold-Ranking Based Topic-Focused Multi-Document Summarization , 2007, IJCAI.

[33]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[34]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[35]  Xiaojun Wan,et al.  Graph-Based MultiModality Learning for Topic-Focused Multi-Document Summarization , 2009 .

[36]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[37]  Furu Wei,et al.  Query-sensitive mutual reinforcement chain and its application in query-oriented multi-document summarization , 2008, SIGIR '08.

[38]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[39]  Jaime Carbonell,et al.  Multi-Document Summarization By Sentence Extraction , 2000 .

[40]  Michael I. Jordan,et al.  Link Analysis, Eigenvectors and Stability , 2001, IJCAI.

[41]  Yi Zhang,et al.  Novelty and redundancy detection in adaptive filtering , 2002, SIGIR '02.

[42]  Charles F. Hockett,et al.  A mathematical theory of communication , 1948, MOCO.

[43]  Yunjie Calvin Xu,et al.  Novelty and topicality in interactive information retrieval , 2008, J. Assoc. Inf. Sci. Technol..