Comparing different knowledge sources for the automatic summarization of biomedical literature

OBJECTIVE Automatic summarization of biomedical literature usually relies on domain knowledge from external sources to build rich semantic representations of the documents to be summarized. In this paper, we investigate the impact of the knowledge source used on the quality of the summaries that are generated. MATERIALS AND METHODS We present a method for representing a set of documents relevant to a given biological entity or topic as a semantic graph of domain concepts and relations. Different graphs are created by using different combinations of ontologies and vocabularies within the UMLS (including GO, SNOMED-CT, HUGO and all available vocabularies in the UMLS) to retrieve domain concepts, and different types of relationships (co-occurrence and semantic relations from the UMLS Metathesaurus and Semantic Network) are used to link the concepts in the graph. The different graphs are next used as input to a summarization system that produces summaries composed of the most relevant sentences from the original documents. RESULTS AND CONCLUSIONS Our experiments demonstrate that the choice of the knowledge source used to model the text has a significant impact on the quality of the automatic summaries. In particular, we find that, when summarizing gene-related literature, using GO, SNOMED-CT and HUGO to extract domain concepts results in significantly better summaries than using all available vocabularies in the UMLS. This finding suggests that successful biomedical summarization requires the selection of the appropriate knowledge source, whose coverage, specificity and relations must be in accordance to the type of the documents to summarize.

[1]  Xin He,et al.  Generating gene summaries from biomedical literature: A study of semi-structured summarization , 2007, Inf. Process. Manag..

[2]  William R. Hersh,et al.  Automatic Summarization of Mouse Gene Information by Clustering and Sentence Extraction from MEDLINE Abstracts , 2007, AMIA.

[3]  Regina Barzilay,et al.  Using Lexical Chains for Text Summarization , 1997 .

[4]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Zhiyong Lu,et al.  Towards Automatic Generation of Gene Summary , 2009, BioNLP@HLT-NAACL.

[6]  Hyoil Han,et al.  The use of domain-specific concepts in biomedical text summarization , 2007, Inf. Process. Manag..

[7]  Pablo Gervás,et al.  Concept-Graph Based Biomedical Automatic Summarization Using Ontologies , 2008, COLING 2008.

[8]  Pablo Gervás,et al.  A semantic graph-based approach to biomedical summarisation , 2011, Artif. Intell. Medicine.

[9]  Jorge Carrillo de Albornoz,et al.  Evaluating the use of different positional strategies for sentence selection in biomedical literature summarization , 2012, BMC Bioinformatics.

[10]  Antonio Jimeno-Yepes,et al.  MeSH indexing based on automatically generated summaries , 2013, BMC Bioinformatics.

[11]  Madeline A. Crosby,et al.  FlyBase: genes and gene models , 2004, Nucleic Acids Res..

[12]  Karen Spärck Jones Automatic summarising: factors and directions , 1998, ArXiv.

[13]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[14]  Jinwoo Park,et al.  Improving text categorization using the importance of sentences , 2004, Inf. Process. Manag..

[15]  Dongwook Shin,et al.  Clustering cliques for graph-based summarization of the biomedical research literature , 2013, BMC Bioinformatics.

[16]  Mathew W. Wright,et al.  The HUGO Gene Nomenclature Committee (HGNC) , 2001, Human Genetics.

[17]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[18]  Dietrich Rebholz-Schuhmann,et al.  Using argumentation to extract key sentences from biomedical abstracts , 2007, Int. J. Medical Informatics.

[19]  Marcelo Fiszman,et al.  The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text , 2003, J. Biomed. Informatics.

[20]  Halil Kilicoglu,et al.  Summarizing Drug Information in Medline Citations , 2006, AMIA.

[21]  Zhiyong Lu,et al.  Finding GeneRIFs via Gene Ontology Annotations , 2005, Pacific Symposium on Biocomputing.

[22]  Panagiotis Stamatopoulos,et al.  Summarization from Medical Documents: A Survey , 2005, Artif. Intell. Medicine.

[23]  Eduard H. Hovy,et al.  Identifying Topics by Position , 1997, ANLP.

[24]  William R. Hersh,et al.  Evaluation of a gene information summarization system by users during the analysis process of microarray datasets , 2009, BMC Bioinformatics.

[25]  Hongfei Lin,et al.  Enhancing Biomedical Text Summarization Using Semantic Relation Extraction , 2011, PloS one.

[26]  Halil Kilicoglu,et al.  Abstraction Summarization for Managing the Biomedical Research Literature , 2004, HLT-NAACL 2004.

[27]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[28]  Petter Holme,et al.  Subnetwork hierarchies of biochemical pathways , 2002, Bioinform..

[29]  Lisa F. Rau,et al.  Automatic Condensation of Electronic Publications by Sentence Selection , 1995, Inf. Process. Manag..

[30]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[31]  Laura Plaza,et al.  AUTOMATIC SUMMARIZATION OF NEWS USING WORDNET CONCEPT GRAPHS , 2010 .

[32]  Jimmy J. Lin,et al.  Answer Extraction, Semantic Clustering, and Extractive Summarization for Clinical Question Answering , 2006, ACL.

[33]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[34]  Yang Wang,et al.  Question Answering Summarization of Multiple Biomedical Documents , 2007, Canadian Conference on AI.

[35]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[36]  Mourad Oussalah,et al.  A Semantic Summarization System: University of Birmingham at TAC 2008 , 2008, TAC.