Query-oriented text summarization based on hypergraph transversals

Abstract The rise in the amount of textual resources available on the Internet has created the need for tools of automatic document summarization. The main challenges of query-oriented extractive summarization are (1) to identify the topics of the documents and (2) to recover query-relevant sentences of the documents that together cover these topics. Existing graph- or hypergraph-based summarizers use graph-based ranking algorithms to produce individual scores of relevance for the sentences. Hence, these systems fail to measure the topics jointly covered by the sentences forming the summary, which tends to produce redundant summaries. To address the issue of selecting non-redundant sentences jointly covering the main query-relevant topics of a corpus, we propose a new method using the powerful theory of hypergraph transversals. First, we introduce a new topic model based on the semantic clustering of terms in order to discover the topics present in a corpus. Second, these topics are modeled as the hyperedges of a hypergraph in which the nodes are the sentences. A summary is then produced by generating a transversal of nodes in the hypergraph. Algorithms based on the theory of submodular functions are proposed to generate the transversals and to build the summaries. The proposed summarizer outperforms existing graph- or hypergraph-based summarizers by at least 6% of ROUGE-SU4 F-measure on DUC 2007 dataset. It is moreover cheaper than existing hypergraph-based summarizers in terms of computational time complexity.

[1]  Hoa Trang Dang,et al.  Overview of DUC 2005 , 2005 .

[2]  Giovanni Semeraro,et al.  Centroid-based Text Summarization through Compositionality of Word Embeddings , 2017, MultiLing@EACL.

[3]  Mohamed Abdel Fattah A hybrid machine learning model for multi-document summarization , 2013, Applied Intelligence.

[4]  Marina Litvak,et al.  Query-based summarization using MDL principle , 2017, MultiLing@EACL.

[5]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[6]  Alaa Hamouda,et al.  Ant colony heuristic for user-contributed comments summarization , 2017, Knowl. Based Syst..

[7]  Sanjeev Arora,et al.  Learning Topic Models -- Going beyond SVD , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[8]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[9]  Paola Vera-Licona,et al.  The minimal hitting set generation problem: algorithms and computation , 2016, SIAM J. Discret. Math..

[10]  Shuzhi Sam Ge,et al.  Mutual-reinforcement document summarization using embedded graph based sentence clustering for storytelling , 2012, Inf. Process. Manag..

[11]  Hui Lin,et al.  A Repository of State of the Art and Competitive Baseline Summaries for Generic News Summarization , 2014, LREC.

[12]  Balaraman Ravindran,et al.  Latent dirichlet allocation based multi-document summarization , 2008, AND '08.

[13]  Richard M. Schwartz,et al.  A Sentence-Trimming Approach to Multi-Document Summarization , 2005 .

[14]  Furu Wei,et al.  Exploring hypergraph-based semi-supervised ranking for query-oriented summarization , 2013, Inf. Sci..

[15]  Jade Goldstein-Stewart,et al.  The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries , 1998, SIGIR Forum.

[16]  Ming Li,et al.  Clustering by compression , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[17]  Martin Porter,et al.  Snowball: A language for stemming algorithms , 2001 .

[18]  Abhishek Kumar,et al.  Aspect based Sentiment Oriented Summarization of Hotel Reviews , 2017 .

[19]  Andreas Krause,et al.  Cost-effective outbreak detection in networks , 2007, KDD '07.

[20]  E. Borosa,et al.  Dual-bounded generating problems: weighted transversals of a hypergraph , 2004 .

[21]  Lin Zhao,et al.  Using External Resources and Joint Learning for Bigram Weighting in ILP-Based Multi-Document Summarization , 2015, NAACL.

[22]  Milad Moradi,et al.  Different approaches for identifying important concepts in probabilistic biomedical text summarization , 2016, Artif. Intell. Medicine.

[23]  Xindong Wu,et al.  Multi-document summarization using closed patterns , 2016, Knowl. Based Syst..

[24]  Dejun Mu,et al.  Word-sentence co-ranking for automatic extractive text summarization , 2017, Expert Syst. Appl..

[25]  Fuzhen Zhuang,et al.  Exploiting relevance, coverage, and novelty for query-focused multi-document summarization , 2013, Knowl. Based Syst..

[26]  Yee Whye Teh,et al.  Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes , 2004, NIPS.

[27]  Hui Lin,et al.  Multi-document Summarization via Budgeted Maximization of Submodular Functions , 2010, NAACL.

[28]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[29]  John Atkinson,et al.  Rhetorics-based multi-document summarization , 2013, Expert Syst. Appl..

[30]  Wenpeng Yin,et al.  Optimizing Sentence Modeling and Selection for Document Summarization , 2015, IJCAI.

[31]  Dragomir R. Radev,et al.  Using Random Walks for Question-focused Sentence Retrieval , 2005, HLT.

[32]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[33]  Ramiz M. Aliguliyev,et al.  QMOS: Query-based multi-documents opinion-oriented summarization , 2018, Inf. Process. Manag..

[34]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[35]  Rada Mihalcea,et al.  Graph-based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization , 2004, ACL.

[36]  Hiroya Takamura,et al.  Text Summarization Model based on Maximum Coverage Problem and its Variant , 2008 .

[37]  Rakesh Chandra Balabantaray,et al.  Cat swarm optimization based evolutionary framework for multi document summarization , 2017 .

[38]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[39]  Laurence A. Wolsey,et al.  An analysis of the greedy algorithm for the submodular set covering problem , 1982, Comb..

[40]  Dimitrios Gunopulos,et al.  Data mining, hypergraph transversals, and machine learning (extended abstract) , 1997, PODS.

[41]  Michael I. Jordan,et al.  Variational methods for the Dirichlet process , 2004, ICML.

[42]  Lior Rokach,et al.  Clustering Methods , 2005, The Data Mining and Knowledge Discovery Handbook.

[43]  M. Wilscy,et al.  Extractive multi-document summarization using population-based multicriteria optimization , 2017, Expert Syst. Appl..

[44]  Hoa Trang Dang,et al.  Overview of DUC 2006 , 2006 .

[45]  Tommy W. S. Chow,et al.  Incorporating word embeddings in the hierarchical dirichlet process for query-oriented text summarization , 2017, 2017 IEEE 15th International Conference on Industrial Informatics (INDIN).

[46]  Dimitrios Gunopulos,et al.  Data mining, hypergraph transversals, and machine learning (extended abstract) , 1997, PODS '97.

[47]  Hongyuan Zha,et al.  Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering , 2002, SIGIR '02.

[48]  Igor Kononenko,et al.  Weighted archetypal analysis of the multi-element graph for query-focused multi-document summarization , 2014, Expert Syst. Appl..

[49]  Mark T. Maybury,et al.  Automatic Summarization , 2002, Computational Linguistics.

[50]  Ani Nenkova,et al.  A Survey of Text Summarization Techniques , 2012, Mining Text Data.

[51]  Ahmad T. Al-Taani,et al.  Hybrid-based Arabic single-document text summarization approach using genatic algorithm , 2016, 2016 7th International Conference on Information and Communication Systems (ICICS).

[52]  Andrea Lancichinetti,et al.  Detecting the overlapping and hierarchical community structure in complex networks , 2008, 0802.1218.

[53]  Sukomal Pal,et al.  Text summarization from legal documents: a survey , 2019, Artificial Intelligence Review.

[54]  Georg Gottlob,et al.  Identifying the Minimal Transversals of a Hypergraph and Related Problems , 1995, SIAM J. Comput..

[55]  S. Dumais Latent Semantic Analysis. , 2005 .

[56]  Rafael Dueire Lins,et al.  Automatic Text Document Summarization Based on Machine Learning , 2015, DocEng.

[57]  Yen-Liang Chen,et al.  Opinion mining from online hotel reviews - A text summarization approach , 2017, Inf. Process. Manag..

[58]  Xiaojun Wan,et al.  Multi-document summarization using cluster-based link analysis , 2008, SIGIR '08.

[59]  Steffen Klamt,et al.  Hypergraphs and Cellular Networks , 2009, PLoS Comput. Biol..

[60]  Rasim M. Alguliyev,et al.  Multiple documents summarization based on evolutionary optimization algorithm , 2013, Expert Syst. Appl..

[61]  Jan Snajder,et al.  Event graphs for information retrieval and multi-document summarization , 2014, Expert Syst. Appl..

[62]  Qinghua Hu,et al.  Multi-document summarization via group sparse learning , 2016, Inf. Sci..

[63]  Igor Kononenko,et al.  Weighted hierarchical archetypal analysis for multi-document summarization , 2016, Comput. Speech Lang..

[64]  Donghong Ji,et al.  Query-focused multi-document summarization using hypergraph-based ranking , 2016, Inf. Process. Manag..

[65]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[66]  Dan Cao,et al.  Analysis of complex network methods for extractive automatic text summarization , 2016, 2016 2nd IEEE International Conference on Computer and Communications (ICCC).

[67]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[68]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[69]  Rafael Dueire Lins,et al.  A multi-document summarization system based on statistics and linguistic treatment , 2014, Expert Syst. Appl..

[70]  Mahmood Yousefi-Azar,et al.  Text summarization using unsupervised deep learning , 2017, Expert Syst. Appl..

[71]  Tao Li,et al.  Multi-Document Summarization via the Minimum Dominating Set , 2010, COLING.

[72]  Mark Last,et al.  Krimping texts for better summarization , 2015, EMNLP.