PuReD-MCL: a graph-based PubMed document clustering methodology

MOTIVATION Biomedical literature is the principal repository of biomedical knowledge, with PubMed being the most complete database collecting, organizing and analyzing such textual knowledge. There are numerous efforts that attempt to exploit this information by using text mining and machine learning techniques. We developed a novel approach, called PuReD-MCL (Pubmed Related Documents-MCL), which is based on the graph clustering algorithm MCL and relevant resources from PubMed. METHODS PuReD-MCL avoids using natural language processing (NLP) techniques directly; instead, it takes advantage of existing resources, available from PubMed. PuReD-MCL then clusters documents efficiently using the MCL graph clustering algorithm, which is based on graph flow simulation. This process allows users to analyse the results by highlighting important clues, and finally to visualize the clusters and all relevant information using an interactive graph layout algorithm, for instance BioLayout Express 3D. RESULTS The methodology was applied to two different datasets, previously used for the validation of the document clustering tool TextQuest. The first dataset involves the organisms Escherichia coli and yeast, whereas the second is related to Drosophila development. PuReD-MCL successfully reproduces the annotated results obtained from TextQuest, while at the same time provides additional insights into the clusters and the corresponding documents. AVAILABILITY Source code in perl and R are available from http://tartara.csd.auth.gr/~theodos/

[1]  A. Schier,et al.  Mutations affecting the development of the embryonic zebrafish brain. , 1996, Development.

[2]  E. Olson,et al.  Genetic regulation of somite formation. , 2000, Current topics in developmental biology.

[3]  Kenji Kita,et al.  Learning an optimal distance metric in a linguistic vector space , 2006, Systems and Computers in Japan.

[4]  Lefteris Angelis,et al.  Validation and interpretation of Web users' sessions clusters , 2007, Inf. Process. Manag..

[5]  Craig A. Struble,et al.  Clustering MeSH Representations of Biomedical Literature , 2004, HLT-NAACL 2004.

[6]  Kenji Kita,et al.  Learning an optimal distance metric in a linguistic vector space , 2006 .

[7]  Hisham M. Haddad,et al.  Proceedings of the 2003 ACM Symposium on Applied Computing (SAC), March 9-12, 2003, Melbourne, FL, USA , 2003, SAC.

[8]  Gerard Salton,et al.  Automatic text analysis , 1970, J. Am. Soc. Inf. Sci..

[9]  Y Yang,et al.  An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts , 1996, Comput. Biol. Medicine.

[10]  Lefteris Angelis,et al.  Gene functional annotation by statistical analysis of biomedical articles , 2007, Int. J. Medical Informatics.

[11]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[12]  D. Grier,et al.  HOX GENES: Seductive Science, Mysterious Mechanisms , 2006, The Ulster medical journal.

[13]  Sophia Ananiadou,et al.  Text mining and its potential applications in systems biology. , 2006, Trends in biotechnology.

[14]  Jacques van Helden,et al.  Evaluation of clustering algorithms for protein-protein interaction networks , 2006, BMC Bioinformatics.

[15]  L. Imhof Matrix Algebra and Its Applications to Statistics and Econometrics , 1998 .

[16]  Lewis Y. Geer,et al.  Database resources of the National Center for Biotechnology Information , 2014, Nucleic Acids Res..

[17]  A. Hope A Simplified Monte Carlo Significance Test Procedure , 1968 .

[18]  Mihaela E. Sardiu,et al.  Evaluation of clustering algorithms for protein complex and protein interaction network assembly. , 2009, Journal of proteome research.

[19]  Eisaku Maeda,et al.  Assigning gene ontology categories (GO) to yeast genes using text-based supervised learning methods , 2004 .

[20]  J. Zhang,et al.  Interactions between Wingless and DFz2 during Drosophila wing development. , 1998, Development.

[21]  C. Tabin,et al.  Sonic hedgehog differentially regulates expression of GLI and GLI3 during limb development. , 1996, Developmental biology.

[22]  Anton J. Enright,et al.  TEXTQUEST: Document Clustering of MEDLINE Abstracts For Concept Discovery In Molecular Biology , 2000, Pacific Symposium on Biocomputing.

[23]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[24]  Leon Goldovsky,et al.  BioLayout(Java): versatile network visualisation of structural and functional relationships. , 2005, Applied bioinformatics.

[25]  Jeffrey T. Chang,et al.  Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. , 2002, Genome research.

[26]  Goran Nenadic,et al.  Terminology-driven mining of biomedical literature , 2003, SAC '03.

[27]  Jean-Cédric Chappelier,et al.  Synonym Dictionary Improvement through Markov Clustering and Clustering Stability , 2005 .

[28]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[29]  Yasunori Yamamoto,et al.  Biomedical knowledge navigation by literature clustering , 2007, J. Biomed. Informatics.

[30]  Concha Bielza,et al.  Machine Learning in Bioinformatics , 2008, Encyclopedia of Database Systems.

[31]  K. E. Ravikumar,et al.  Literature mining and database annotation of protein phosphorylation using a rule-based system , 2005, Bioinform..

[32]  Dominic Widdows,et al.  Using Curvature and Markov Clustering in Graphs for Lexical Acquisition and Word Sense Discrimination , 2004 .

[33]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[34]  Thomas H. Wonnacott,et al.  Introductory Statistics , 2007, Technometrics.

[35]  Aneel K. Aggarwal,et al.  Structure of a DNA-bound Ultrabithorax–Extradenticle homeodomain complex , 1999, Nature.

[36]  S. Dongen Graph clustering by flow simulation , 2000 .