A New Biomedical Text Summarization Method Based on Sentence Clustering and Frequent Itemsets Mining

In this paper, we combined sentence clustering and frequent itemsets mining to build a single biomedical text summarization method. Biomedical documents are represented as a sets of UMLS concepts. Very generic concepts are discarded. The vector space model is used to represent sentences. The K-means clustering algorithm is applied to cluster semantically similar sentences. The most frequent itemsets are extracted among the global cluster. The generated frequent itemsets are used to calculate the score of sentences. The top N highly scoring sentences are selected to represent the final summary. The method is evaluated against three summarizers: TextRank, SweSum and Itemset based summarizer on a 50 randomly selected biomedical papers from the BioMed Central database full text. The evaluation process consists of comparing the generated summaries with the abstracts of these papers using the ROUGE toolkit. Our method achieved good results, it ranked first in ROUGE-1 and ROUGE-2 measures with an improvement of \(\sim \)3% than the Itemset based summarizer and it ranked second in ROUGE-SU4 measure with a diminution of \(\sim \)1% always against the Itemset based summarizer.

[1]  Christophe Rigotti,et al.  A condensed representation to find frequent patterns , 2001, PODS '01.

[2]  Noémie Elhadad,et al.  Natural Language Processing in Health Care and Biomedicine , 2014 .

[3]  Xiaohua Hu,et al.  A coherent graph-based semantic clustering and summarization approach for biomedical literature and a new summarization evaluation method , 2007, BMC Bioinformatics.

[4]  E. Shortliffe,et al.  Comprar Biomedical Informatics. Computer Applications In Health Care And Biomedicine | Shortliffe, E. | 9781447144731 | Springer , 2013 .

[5]  Hongfei Lin,et al.  Enhancing Biomedical Text Summarization Using Semantic Relation Extraction , 2011, PloS one.

[6]  Mark Stevenson,et al.  Improving Summarization of Biomedical Documents Using Word Sense Disambiguation , 2010, BioNLP@ACL.

[7]  H. Anton Elementary Linear Algebra , 1970 .

[8]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[9]  Guilherme Del Fiol,et al.  Text summarization in the biomedical domain: A systematic review of recent research , 2014, J. Biomed. Informatics.

[10]  Hyoil Han,et al.  Concept frequency distribution in biomedical text summarization , 2006, CIKM '06.

[11]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[12]  Luciana B Sollaci,et al.  The introduction, methods, results, and discussion (IMRAD) structure: a fifty-year survey. , 2004, Journal of the Medical Library Association : JMLA.

[13]  D. Mammass,et al.  A graph based method for Arabic document indexing , 2016, 2016 7th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT).

[14]  Pablo Gervás,et al.  A semantic graph-based approach to biomedical summarisation , 2011, Artif. Intell. Medicine.

[15]  Karen Spärck Jones Automatic summarising: factors and directions , 1998, ArXiv.

[16]  Hyoil Han,et al.  BioChain: lexical chaining methods for biomedical text summarization , 2006, SAC.

[17]  Hyoil Han,et al.  The use of domain-specific concepts in biomedical text summarization , 2007, Inf. Process. Manag..

[18]  B. Džuganová,et al.  English medical terminology – different ways of forming medical terms , 2013 .

[19]  David Camacho,et al.  A genetic graph-based clustering approach to biomedical summarization , 2013, WIMS '13.

[20]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[21]  Mohamed Ali Hadj Taieb,et al.  Computing semantic similarity between biomedical concepts using new information content approach , 2016, J. Biomed. Informatics.

[22]  Ruslan Mitkov,et al.  The Oxford handbook of computational linguistics , 2003 .

[23]  Milad Moradi,et al.  Quantifying the informativeness for biomedical literature summarization: An itemset mining method , 2016, Comput. Methods Programs Biomed..

[24]  Nasser Ghadiri,et al.  Evaluating Different Similarity Measures for Automatic Biomedical Text Summarization , 2017, ISDA.

[25]  Guoliang Chen,et al.  A fast algorithm for mining association rules , 2008, Journal of Computer Science and Technology.

[26]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[27]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[28]  Alan R. Aronson,et al.  Semi-Automatic Indexing of Full Text Biomedical Articles , 2005, AMIA.

[29]  Rada Mihalcea,et al.  Graph-based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization , 2004, ACL.

[30]  Y. Gheraibia,et al.  Ontology and automatic code generation on modeling and simulation , 2012, 2012 6th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT).