Quantifying the informativeness for biomedical literature summarization: An itemset mining method

OBJECTIVE Automatic text summarization tools can help users in the biomedical domain to access information efficiently from a large volume of scientific literature and other sources of text documents. In this paper, we propose a summarization method that combines itemset mining and domain knowledge to construct a concept-based model and to extract the main subtopics from an input document. Our summarizer quantifies the informativeness of each sentence using the support values of itemsets appearing in the sentence. METHODS To address the concept-level analysis of text, our method initially maps the original document to biomedical concepts using the Unified Medical Language System (UMLS). Then, it discovers the essential subtopics of the text using a data mining technique, namely itemset mining, and constructs the summarization model. The employed itemset mining algorithm extracts a set of frequent itemsets containing correlated and recurrent concepts of the input document. The summarizer selects the most related and informative sentences and generates the final summary. RESULTS We evaluate the performance of our itemset-based summarizer using the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metrics, performing a set of experiments. We compare the proposed method with GraphSum, TexLexAn, SweSum, SUMMA, AutoSummarize, the term-based version of the itemset-based summarizer, and two baselines. The results show that the itemset-based summarizer performs better than the compared methods. The itemset-based summarizer achieves the best scores for all the assessed ROUGE metrics (R-1: 0.7583, R-2: 0.3381, R-W-1.2: 0.0934, and R-SU4: 0.3889). We also perform a set of preliminary experiments to specify the best value for the minimum support threshold used in the itemset mining algorithm. The results demonstrate that the value of this threshold directly affects the accuracy of the summarization model, such that a significant decrease can be observed in the performance of summarization due to assigning extreme thresholds. CONCLUSION Compared to the statistical, similarity, and word frequency methods, the proposed method demonstrates that the summarization model obtained from the concept extraction and itemset mining provides the summarizer with an effective metric for measuring the informative content of sentences. This can lead to an improvement in the performance of biomedical literature summarization.

[1]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[2]  David Camacho,et al.  Combining graph connectivity and genetic clustering to improve biomedical summarization , 2014, 2014 IEEE Congress on Evolutionary Computation (CEC).

[3]  Jilles Vreeken,et al.  Tell me what i need to know: succinctly summarizing data with itemsets , 2011, KDD.

[4]  Lucas Antiqueira,et al.  A complex network approach to text summarization , 2009, Inf. Sci..

[5]  Sanda M. Harabagiu,et al.  Using topic themes for multi-document summarization , 2010, TOIS.

[6]  Hyoil Han,et al.  BioChain: lexical chaining methods for biomedical text summarization , 2006, SAC.

[7]  Wei-Pang Yang,et al.  Text summarization using a trainable summarizer and latent semantic analysis , 2005, Inf. Process. Manag..

[8]  Elizabeth León Guzman,et al.  Extractive single-document summarization based on genetic operators and guided local search , 2014, Expert Syst. Appl..

[9]  Mark Stevenson,et al.  Resolving ambiguity in biomedical text to improve summarization , 2012, Inf. Process. Manag..

[10]  Laura Plaza Comparing different knowledge sources for the automatic summarization of biomedical literature , 2014, J. Biomed. Informatics.

[11]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[12]  C. Ordonez,et al.  Constraining and summarizing association rules in medical data , 2006 .

[13]  Gurpreet Singh Lehal,et al.  A Survey of Text Summarization Extractive Techniques , 2010 .

[14]  D. Lindberg,et al.  The Unified Medical Language System , 1993, Methods of Information in Medicine.

[15]  Halil Kilicoglu,et al.  Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing: Preliminary experiment , 2006, J. Assoc. Inf. Sci. Technol..

[16]  Duy Duc An Bui,et al.  Extractive text summarization system to aid data extraction from full text in systematic review development , 2016, J. Biomed. Informatics.

[17]  Horacio Saggion A Robust and Adaptable Summarization Tool , 2008 .

[18]  Panagiotis Stamatopoulos,et al.  Summarization from Medical Documents: A Survey , 2005, Artif. Intell. Medicine.

[19]  Kamal Sarkar,et al.  Using Domain Knowledge for Text Summarization in Medical Domain , 2009 .

[20]  Vishal Gupta,et al.  Recent automatic text summarization techniques: a survey , 2016, Artificial Intelligence Review.

[21]  Jorge Carrillo de Albornoz,et al.  Evaluating the use of different positional strategies for sentence selection in biomedical literature summarization , 2012, BMC Bioinformatics.

[22]  W. Alkema,et al.  Application of text mining in the biomedical domain. , 2015, Methods.

[23]  Horacio Saggion,et al.  SUMMA. A Robust and Adaptable Summarization Tool , 2008, TAL.

[24]  Elena Lloret,et al.  Text summarisation in progress: a literature review , 2011, Artificial Intelligence Review.

[25]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[26]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[27]  Rasim M. Alguliyev,et al.  Effective summarization method of text documents , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[28]  Chin-Yew Lin,et al.  Looking for a Few Good Metrics: Automatic Summarization Evaluation - How Many Samples Are Enough? , 2004, NTCIR.

[29]  Karen Sparck Jones,et al.  Book Reviews: Evaluating Natural Language Processing Systems: An Analysis and Review , 1996, CL.

[30]  Guilherme Del Fiol,et al.  Text summarization in the biomedical domain: A systematic review of recent research , 2014, J. Biomed. Informatics.

[31]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[32]  Christian Blaschke,et al.  Status of text-mining techniques applied to biomedical text. , 2006, Drug discovery today.

[33]  Luca Cagliero,et al.  GraphSum: Discovering correlations among multiple terms for graph-based summarization , 2013, Inf. Sci..

[34]  M. Owen,et al.  Genetic overlap between autism, schizophrenia and bipolar disorder , 2009, Genome Medicine.

[35]  Luca Cagliero,et al.  Multi-document summarization exploiting frequent itemsets , 2012, SAC '12.

[36]  Sun Park,et al.  Automatic generic document summarization based on non-negative matrix factorization , 2009, Inf. Process. Manag..

[37]  Pablo Gervás,et al.  A semantic graph-based approach to biomedical summarisation , 2011, Artif. Intell. Medicine.

[38]  Fuji Ren,et al.  GA, MR, FFNN, PNN and GMM based models for automatic text summarization , 2009, Comput. Speech Lang..

[39]  Ronen Feldman,et al.  TEG—a hybrid approach to information extraction , 2005, Knowledge and Information Systems.

[40]  Rasim M. Alguliyev,et al.  MCMR: Maximum coverage and minimum redundant text summarization model , 2011, Expert Syst. Appl..

[41]  Hyoil Han,et al.  Concept frequency distribution in biomedical text summarization , 2006, CIKM '06.

[42]  M. Norton Genome Medicine: the future of medicine , 2009, Genome Medicine.

[43]  Ping Chen,et al.  A Query-Based Medical Information Summarization System Using Ontology Knowledge , 2006, 19th IEEE Symposium on Computer-Based Medical Systems (CBMS'06).

[44]  Hyoil Han,et al.  The use of domain-specific concepts in biomedical text summarization , 2007, Inf. Process. Manag..

[45]  Ruslan Mitkov,et al.  The Oxford handbook of computational linguistics , 2003 .

[46]  Peng Shi,et al.  Enhancing sentence-level clustering with ranking-based clustering framework for theme-based summarization , 2014, Inf. Sci..

[47]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[48]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[49]  ELENA BARALIS,et al.  MWI-Sum: A Multilingual Summarizer Based on Frequent Weighted Itemsets , 2015, TOIS.

[50]  Aytug Onan,et al.  Ensemble of keyword extraction methods and classifiers in text classification , 2016, Expert Syst. Appl..

[51]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..