Clustering cliques for graph-based summarization of the biomedical research literature

BackgroundGraph-based notions are increasingly used in biomedical data mining and knowledge discovery tasks. In this paper, we present a clique-clustering method to automatically summarize graphs of semantic predications produced from PubMed citations (titles and abstracts).ResultsSemRep is used to extract semantic predications from the citations returned by a PubMed search. Cliques were identified from frequently occurring predications with highly connected arguments filtered by degree centrality. Themes contained in the summary were identified with a hierarchical clustering algorithm based on common arguments shared among cliques. The validity of the clusters in the summaries produced was compared to the Silhouette-generated baseline for cohesion, separation and overall validity. The theme labels were also compared to a reference standard produced with major MeSH headings.ConclusionsFor 11 topics in the testing data set, the overall validity of clusters from the system summary was 10% better than the baseline (43% versus 33%). While compared to the reference standard from MeSH headings, the results for recall, precision and F-score were 0.64, 0.65, and 0.65 respectively.

[1]  Guimei Liu,et al.  Complex discovery from weighted PPI networks , 2009, Bioinform..

[2]  Mirella Lapata,et al.  Proceedings of the 12th Conference of the European Chapter of the ACL , 2009 .

[3]  Kathleen R. McKeown,et al.  Domain-specific informative and indicative summarization for information retrieval , 2001 .

[4]  Hsinchun Chen,et al.  Medical Informatics: Knowledge Management and Data Mining in Biomedicine (Operations Research/Computer Science Interfaces) , 2005 .

[5]  Xiang Zhang,et al.  Ontology summarization based on rdf sentence graph , 2007, WWW '07.

[6]  S. Borgatti,et al.  Analyzing Clique Overlap , 2009 .

[7]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[8]  Yasunori Yamamoto,et al.  Biomedical knowledge navigation by literature clustering , 2007, J. Biomed. Informatics.

[9]  Kevin W. Boyack,et al.  Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches , 2011, PloS one.

[10]  Rafael Berlanga Llavori,et al.  Topic discovery based on text mining techniques , 2007, Inf. Process. Manag..

[11]  Ani Nenkova,et al.  Automatic Summarization , 2011, ACL.

[12]  Allen C. Browne,et al.  Lexical methods for managing variation in biomedical terminologies. , 1994, Proceedings. Symposium on Computer Applications in Medical Care.

[13]  Marcelo Fiszman,et al.  Semantic Interpretation for the Biomedical Research Literature , 2005 .

[14]  Olivier Bodenreider,et al.  Aggregating UMLS Semantic Types for Reducing Conceptual Complexity , 2001, MedInfo.

[15]  J. B. Kruskal,et al.  Icicle Plots: Better Displays for Hierarchical Clustering , 1983 .

[16]  William R. Hersh,et al.  Automatic Summarization of Mouse Gene Information by Clustering and Sentence Extraction from MEDLINE Abstracts , 2007, AMIA.

[17]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[18]  Hyoil Han,et al.  Concept frequency distribution in biomedical text summarization , 2006, CIKM '06.

[19]  BMC Bioinformatics , 2005 .

[20]  Dragomir R. Radev,et al.  Identifying gene-disease associations using centrality on a literature mined gene-interaction network , 2008, ISMB.

[21]  Brian Everitt,et al.  Cluster analysis , 1974 .

[22]  Dongwook Shin,et al.  Degree centrality for semantic abstraction summarization of therapeutic studies , 2011, J. Biomed. Informatics.

[23]  Michael Jünger,et al.  Graph Drawing Software , 2003, Graph Drawing Software.

[24]  Bin Zhang,et al.  Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R , 2008, Bioinform..

[25]  Imran Chowdhury,et al.  Information Communities: The Network Structure of Communication , 2012, Soc. Networks.

[26]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[27]  Halil Kilicoglu,et al.  Semantic MEDLINE: A web application for managing the results of PubMed searches , 2008, SMBM 2008.

[28]  A. Steven Klusener,et al.  Applying a dynamic threshold to improve cluster detection of LSI , 2011, Sci. Comput. Program..

[29]  Guillaume Jacquet,et al.  Clique-Based Clustering for Improving Named Entity Recognition Systems , 2009, EACL.

[30]  Dan I. Moldovan,et al.  Proceedings of the HLT-NAACL Workshop on Computational Lexical Semantics , 2004 .

[31]  Mark Gerstein,et al.  Predicting interactions in protein networks by completing defective cliques , 2006, Bioinform..

[32]  Martin Halvey,et al.  WWW '07: Proceedings of the 16th international conference on World Wide Web , 2007, WWW 2007.

[33]  Karen Spärck Jones Automatic summarising: The state of the art , 2007, Inf. Process. Manag..

[34]  Halil Kilicoglu,et al.  Abstraction Summarization for Managing the Biomedical Research Literature , 2004, HLT-NAACL 2004.

[35]  Antoine Naud,et al.  Exploration of a collection of documents in neuroscience and extraction of topics by clustering , 2008, Neural Networks.

[36]  Xiaohua Hu,et al.  A coherent graph-based semantic clustering and summarization approach for biomedical literature and a new summarization evaluation method , 2007, BMC Bioinformatics.

[37]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[38]  M. Norusis PASW Statistics 18 Statistical Procedures Companion , 2010 .

[39]  Hans-Peter Kriegel,et al.  Extraction of semantic biomedical relations from text using conditional random fields , 2008, BMC Bioinformatics.

[40]  Vladimir Batagelj,et al.  Pajek - Analysis and Visualization of Large Networks , 2004, Graph Drawing Software.

[41]  Weiqing Wang,et al.  Exploring supervised and unsupervised methods to detect topics in biomedical text , 2006, BMC Bioinformatics.

[42]  Etsuji Tomita,et al.  Clique-based data mining for related genes in a biomedical database , 2009, BMC Bioinformatics.

[43]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[44]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[45]  Marcelo Fiszman,et al.  The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text , 2003, J. Biomed. Informatics.

[46]  Hyoil Han,et al.  The use of domain-specific concepts in biomedical text summarization , 2007, Inf. Process. Manag..

[47]  Tomek Strzalkowski,et al.  Interactive, Text‐Based Summarization of Multiple Documents , 2000, Comput. Intell..

[48]  Johan Bollen,et al.  Co-authorship networks in the digital library research community , 2005, Inf. Process. Manag..

[49]  Dragomir R. Radev,et al.  LexRank: Graph-based Centrality as Salience in Text Summarization , 2004 .

[50]  Halil Kilicoglu,et al.  Automatic summarization of MEDLINE citations for evidence-based medical treatment: A topic-oriented evaluation , 2009, J. Biomed. Informatics.

[51]  Halil Kilicoglu,et al.  Constructing a semantic predication gold standard from the biomedical literature , 2011, BMC Bioinformatics.

[52]  Ani Nenkova,et al.  The Impact of Frequency on Summarization , 2005 .

[53]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[54]  Christian Wartena,et al.  Topic Detection by Clustering Keywords , 2008, 2008 19th International Workshop on Database and Expert Systems Applications.

[55]  Thomas C. Rindflesch,et al.  MedPost: a part-of-speech tagger for bioMedical text , 2004, Bioinform..

[56]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.