PatTexSum : A pattern-based text summarizer

In the last decade the growth of the Internet has made a huge amount of textual documents available in the electronic form. Text summarization is commonly based on clustering or graph-based methods and usually considers the bag-of-word sentence representation. Frequent itemset mining is a widely exploratory technique to discover relevant correlations among data. The well-established application of frequent itemsets to large transactional datasets prompts their usage in the context of document summarization as well. This paper proposes a novel multi-document summarizer, namely PatTexSum (Pattern-based Text Summarizer), that is mainly based on a patternbased model, i.e., a model composed of frequent itemsets. Unlike previously proposed approaches, PatTexSum selects most representative and not redundant sentences to include in the summary by considering both (i) the most informative and non-redundant itemsets extracted from document collections tailored to the transactional data format, and (ii) a sentence score, based on the tf-idf statistics. Experiments conducted on a collection of real news articles show the effectiveness of the proposed approach.

[1]  Hannes Heikinheimo,et al.  Decomposable Families of Itemsets , 2008, ECML/PKDD.

[2]  Szymon Jaroszewicz,et al.  Interestingness of frequent itemsets using Bayesian networks as background knowledge , 2004, KDD.

[3]  Nikolaj Tatti,et al.  Probably the best itemsets , 2010, KDD.

[4]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[5]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents , 2004, Inf. Process. Manag..

[6]  Tao Li,et al.  Document update summarization using incremental hierarchical clustering , 2010, CIKM.

[7]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[8]  Hiroya Takamura,et al.  Text summarization model based on the budgeted median problem , 2009, CIKM.

[9]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[10]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[11]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[12]  Tijl De Bie,et al.  An Information-Theoretic Approach to Finding Informative Noisy Tiles in Binary Databases , 2010, SDM.

[13]  John M. Conroy Left-Brain/Right-Brain Multi-Document Summarization , 2004 .

[14]  Jilles Vreeken,et al.  Tell me what i need to know: succinctly summarizing data with itemsets , 2011, KDD.

[15]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[16]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[17]  Elena Baralis,et al.  Minimum number of genes for microarray feature selection , 2008, 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.