Multi-document summarization using closed patterns

There are two main categories of multi-document summarization: term-based and ontology-based methods. A term-based method cannot deal with the problems of polysemy and synonymy. An ontology-based approach addresses such problems by taking into account of the semantic information of document content, but the construction of ontology requires lots of manpower. To overcome these open problems, this paper presents a pattern-based model for generic multi-document summarization, which exploits closed patterns to extract the most salient sentences from a document collection and reduce redundancy in the summary. Our method calculates the weight of each sentence of a document collection by accumulating the weights of its covering closed patterns with respect to this sentence, and iteratively selects one sentence that owns the highest weight and less similarity to the previously selected sentences, until reaching the length limitation. The sentence weight calculation by patterns reduces the dimension and captures more relevant information. Our method combines the advantages of the term-based and ontology-based models while avoiding their weaknesses. Empirical studies on the benchmark DUC2004 datasets demonstrate that our pattern-based method significantly outperforms the state-of-the-art methods. Multi-document summarization can be used to extract a particular individual's opinions in the form of closed patterns, from this individual's documents shared in social networks, hence provides a useful tool for further analyzing the individual's behavior and influence in group activities.

[1]  Yuefeng Li,et al.  Mining Specific Features for Acquiring User Information Needs , 2013, PAKDD.

[2]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[3]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[4]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD 2000.

[5]  Mark T. Maybury,et al.  Advances in Automatic Text Summarization , 1999 .

[6]  Hua Li,et al.  Document Summarization Using Conditional Random Fields , 2007, IJCAI.

[7]  Chao-Lin Liu,et al.  Ontology-based Text Summarization for Business News Articles , 2003, CATA.

[8]  Sergey Brin,et al.  The Anatomy of a Search Engine , 2009 .

[9]  Xijin Tang,et al.  Text clustering using frequent itemsets , 2010, Knowl. Based Syst..

[10]  Thierry Poibeau,et al.  Automatic Text Summarization: Past, Present and Future , 2013, Multi-source, Multilingual Information Extraction and Summarization.

[11]  Robert Wetzker,et al.  An Ontology-Based Approach to Text Summarization , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[12]  Zongda Wu,et al.  Improving contextual advertising matching by using Wikipedia thesaurus knowledge , 2014, Knowledge and Information Systems.

[13]  Xin Liu,et al.  Generic text summarization using relevance measure and latent semantic analysis , 2001, SIGIR '01.

[14]  Chris H. Q. Ding,et al.  Weighted Feature Subset Non-negative Matrix Factorization and Its Applications to Document Understanding , 2010, ICDM.

[15]  Yihong Gong,et al.  Integrating Document Clustering and Multidocument Summarization , 2011, TKDD.

[16]  Sun Park,et al.  Automatic generic document summarization based on non-negative matrix factorization , 2009, Inf. Process. Manag..

[17]  Tao Li,et al.  Ontology-enriched multi-document summarization in disaster management , 2010, SIGIR.

[18]  Nizar R. Mabroukeh,et al.  A taxonomy of sequential pattern mining algorithms , 2010, CSUR.

[19]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents , 2004, Inf. Process. Manag..

[20]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[21]  José Francisco Martínez Trinidad,et al.  AGraP: an algorithm for mining frequent patterns in a single graph using inexact matching , 2014, Knowledge and Information Systems.

[22]  Rasim M. Alguliyev,et al.  Multiple documents summarization based on evolutionary optimization algorithm , 2013, Expert Syst. Appl..

[23]  Vivi Nastase,et al.  Topic-Driven Multi-Document Summarization with Encyclopedic Knowledge and Spreading Activation , 2008, EMNLP.

[24]  Chun Chen,et al.  Tag-oriented document summarization , 2009, WWW '09.

[25]  Kenneth Wai-Ting Leung,et al.  SFP-Rank: significant frequent pattern analysis for effective ranking , 2015, Knowledge and Information Systems.

[26]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[27]  M. Saravanan,et al.  Automatic Identification of Rhetorical Roles using Conditional Random Fields for Legal Document Summarization , 2008, IJCNLP.

[28]  Xindong Wu,et al.  Document-Specific Keyphrase Extraction Using Sequential Patterns with Wildcards , 2014, 2014 IEEE International Conference on Data Mining.

[29]  Dianne P. O'Leary,et al.  Text summarization via hidden Markov models , 2001, SIGIR '01.

[30]  Mario Cannataro,et al.  Protein-to-protein interactions: Technologies, databases, and algorithms , 2010, CSUR.

[31]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[32]  Jade Goldstein-Stewart,et al.  The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries , 1998, SIGIR Forum.

[33]  Soon Myoung Chung,et al.  Text document clustering based on frequent word meaning sequences , 2008, Data Knowl. Eng..

[34]  Alexandre Termier,et al.  PGLCM: efficient parallel mining of closed frequent gradual itemsets , 2010, 2010 IEEE International Conference on Data Mining.

[35]  Juan-Zi Li,et al.  Social context summarization , 2011, SIGIR.

[36]  Xiaojun Wan,et al.  Multi-document Summarization Using Minimum Distortion , 2010, 2010 IEEE International Conference on Data Mining.

[37]  Cherif Chiraz Latiri,et al.  LC-mine: a framework for frequent subgraph mining with local consistency techniques , 2014, Knowledge and Information Systems.

[38]  Luca Cagliero,et al.  Multi-document summarization exploiting frequent itemsets , 2012, SAC '12.

[39]  Tao Li,et al.  Multi-document summarization via submodularity , 2012, Applied Intelligence.

[40]  Jimmy J. Lin,et al.  Single-document and multi-document summarization techniques for email threads using sentence compression , 2008, Inf. Process. Manag..

[41]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[42]  Malladi Ravisankar,et al.  Effective Pattern Discovery for Text Mining , 2018 .

[43]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[44]  Luca Cagliero,et al.  Multi-document summarization based on the Yago ontology , 2013, Expert Syst. Appl..

[45]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[46]  Yue Xu,et al.  Deploying Approaches for Pattern Refinement in Text Mining , 2006, Sixth International Conference on Data Mining (ICDM'06).

[47]  A. Kogilavani,et al.  Ontology Enhanced Clustering Based Summarization of Medical Documents , 2009 .

[48]  Danushka Bollegala,et al.  A Bottom-Up Approach to Sentence Ordering for Multi-Document Summarization , 2006, Annual Meeting of the Association for Computational Linguistics.