Word Embedding-Based Biomedical Text Summarization

In this paper, we have proposed a novel word embedding-based biomedical text summarizer. Biomedical words are represented by real dense vectors. Sentences are represented by summing-up the word vectors that contain. The PageRank algorithm is applied to rank sentences using the cosine similarity as a distance measure between sentences vectors. The top N highly ranked sentences are selected to build the summary. For the evaluation, we created a corpus of 200 biomedical papers downloaded from the Biomed Central full-text database. We used a pre-trained Word2vec model of word vectors generated from a combination of PubMed, PMC, and recent English Wikipedia dump texts. We compared our method with four other summarizers using: ROUGE-1, ROUGE-2, ROUGE-3, and ROUGE-SU4 metrics by evaluating the generated summaries with the abstracts of papers. Our summarizer achieved an improvement of 3.48%, 7.68%, 9.76%, and 3.47% respectively against the second-ranked summarizer.

[1]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents , 2004, Inf. Process. Manag..

[2]  Milad Moradi,et al.  Quantifying the informativeness for biomedical literature summarization: An itemset mining method , 2016, Comput. Methods Programs Biomed..

[3]  Xiaohua Hu,et al.  A coherent graph-based semantic clustering and summarization approach for biomedical literature and a new summarization evaluation method , 2007, BMC Bioinformatics.

[4]  Oussama Rouane,et al.  Combine clustering and frequent itemsets mining to enhance biomedical text summarization , 2019, Expert Syst. Appl..

[5]  Panagiotis Stamatopoulos,et al.  Summarization from Medical Documents: A Survey , 2005, Artif. Intell. Medicine.

[6]  Pablo Gervás,et al.  A semantic graph-based approach to biomedical summarisation , 2011, Artif. Intell. Medicine.

[7]  Yu Fang,et al.  An answer summarization method based on keyword extraction , 2017 .

[8]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[9]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[10]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies , 2000, ArXiv.

[11]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  Setu Shah,et al.  Concept embedding-based weighting scheme for biomedical text clustering and visualization , 2018, Applied Informatics.

[14]  Ion Androutsopoulos,et al.  Using Centroids of Word Embeddings and Word Mover’s Distance for Biomedical Document Retrieval in Question Answering , 2016, BioNLP@ACL.

[15]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[16]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[17]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[18]  Hyoil Han,et al.  BioChain: lexical chaining methods for biomedical text summarization , 2006, SAC.

[19]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[20]  Rada Mihalcea,et al.  Graph-based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization , 2004, ACL.

[21]  Noémie Elhadad,et al.  Natural Language Processing in Health Care and Biomedicine , 2014 .

[22]  E. Shortliffe,et al.  Comprar Biomedical Informatics. Computer Applications In Health Care And Biomedicine | Shortliffe, E. | 9781447144731 | Springer , 2013 .

[23]  David Camacho,et al.  A genetic graph-based clustering approach to biomedical summarization , 2013, WIMS '13.

[24]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.