A PubMed-Wide Associational Study of Infectious Diseases

Background Computational discovery is playing an ever-greater role in supporting the processes of knowledge synthesis. A significant proportion of the more than 18 million manuscripts indexed in the PubMed database describe infectious disease syndromes and various infectious agents. This study is the first attempt to integrate online repositories of text-based publications and microbial genome databases in order to explore the dynamics of relationships between pathogens and infectious diseases. Methodology/Principal Findings Herein we demonstrate how the knowledge space of infectious diseases can be computationally represented and quantified, and tracked over time. The knowledge space is explored by mapping of the infectious disease literature, looking at dynamics of literature deposition, zooming in from pathogen to genome level and searching for new associations. Syndromic signatures for different pathogens can be created to enable a new and clinically focussed reclassification of the microbial world. Examples of syndrome and pathogen networks illustrate how multilevel network representations of the relationships between infectious syndromes, pathogens and pathogen genomes can illuminate unexpected biological similarities in disease pathogenesis and epidemiology. Conclusions/Significance This new approach based on text and data mining can support the discovery of previously hidden associations between diseases and microbial pathogens, clinically relevant reclassification of pathogenic microorganisms and accelerate the translational research enterprise.

[1]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[2]  Mark Gerstein,et al.  An Integrative Genomic Approach to Uncover Molecular Mechanisms of Prokaryotic Traits , 2006, PLoS Comput. Biol..

[3]  J. Belisle,et al.  Morphological features and signature gene response elicited by inactivation of FtsI in Mycobacterium tuberculosis. , 2009, The Journal of antimicrobial chemotherapy.

[4]  Muin J. Khoury,et al.  Gene Prospector: An evidence gateway for evaluating potential susceptibility genes and interacting risk factors for human diseases , 2008, BMC Bioinformatics.

[5]  Serge Mostowy,et al.  PhoP: A Missing Piece in the Intricate Puzzle of Mycobacterium tuberculosis Virulence , 2008, PloS one.

[6]  Allen C. Browne,et al.  UMLS language and vocabulary tools. , 2003, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[7]  A. Butte,et al.  Creation and implications of a phenome-genome network , 2006, Nature Biotechnology.

[8]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[9]  Michael R. Seringhaus,et al.  Seeking a New Biology through Text Mining , 2008, Cell.

[10]  Henry D. Isenberg,et al.  Manual of Clinical Microbiology , 1991 .

[11]  Félix de Moya Anegón,et al.  Visualizing the structure of science , 2007 .

[12]  S. Lewis,et al.  The generic genome browser: a building block for a model organism system database. , 2002, Genome research.

[13]  Sam Zaremba,et al.  Text-mining of PubMed abstracts by natural language processing to create a public knowledge base on molecular mechanisms of bacterial enteropathogens , 2009, BMC Bioinformatics.

[14]  Y. He,et al.  PHIDIAS: a pathogen-host interaction data integration and analysis system , 2007, Genome Biology.

[15]  M. Daffé,et al.  Mycobacterium tuberculosis Cpn60.2 and DnaK Are Located on the Bacterial Surface, Where Cpn60.2 Facilitates Efficient Bacterial Association with Macrophages , 2009, Infection and Immunity.

[16]  Daniel E Koshland The Cha-Cha-Cha Theory of Scientific Discovery , 2007, Science.

[17]  John E. Bennett,et al.  Principles and practice of infectious diseases. Vols 1 and 2. , 1979 .

[18]  Jonathan D. Wren,et al.  Knowledge discovery by automated identification and ranking of implicit relationships , 2004, Bioinform..

[19]  Alan H. Fielding,et al.  Cluster and Classification Techniques for the Biosciences , 2006 .

[20]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[21]  Raul Rodriguez-Esteban,et al.  Biomedical Text Mining and Its Applications , 2009, PLoS Comput. Biol..

[22]  Tanya Parish,et al.  The common aromatic amino acid biosynthesis pathway is essential in Mycobacterium tuberculosis. , 2002, Microbiology.

[23]  Jonathan D Wren,et al.  Medline: the knowledge buried therein, its potential, and cost. , 2007, IEEE engineering in medicine and biology magazine : the quarterly magazine of the Engineering in Medicine & Biology Society.

[24]  Yoshihiro Yamanishi,et al.  KEGG for linking genomes to life and the environment , 2007, Nucleic Acids Res..

[25]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[26]  J. Sacchettini,et al.  Structure and function of the virulence-associated high-temperature requirement A of Mycobacterium tuberculosis. , 2008, Biochemistry.

[27]  Masaru Tomita,et al.  Genome Projector: zoomable genome map with multiple views , 2009, BMC Bioinformatics.

[28]  Hagit Shatkay,et al.  Discovering semantic features in the literature: a foundation for building functional associations , 2006, BMC Bioinformatics.

[29]  S. Fortune,et al.  Mycobacterium tuberculosis evades macrophage defenses by inhibiting plasma membrane repair , 2009, Nature Immunology.

[30]  Mark Gerstein,et al.  Integration of curated databases to identify genotype-phenotype associations , 2006, BMC Genomics.

[31]  Peer Bork,et al.  Systematic Association of Genes to Phenotypes by Genome and Literature Mining , 2005, PLoS biology.

[32]  A. Valencia,et al.  Linking genes to literature: text mining, information extraction, and retrieval applications for biology , 2008, Genome Biology.

[33]  Rob Jelier,et al.  CoPub Mapper: mining MEDLINE based on search term co-publication , 2005, BMC Bioinformatics.

[34]  Stephen G. J. Smith,et al.  A molecular Swiss army knife: OmpA structure, function and expression. , 2007, FEMS microbiology letters.

[35]  Bjoern Peters,et al.  Classification of the Universe of Immune Epitope Literature: Representation and Knowledge Gaps , 2009, PloS one.

[36]  Michael Q. Zhang,et al.  Network-based global inference of human disease genes , 2008, Molecular systems biology.

[37]  Jonathan R. Iredell,et al.  Pathogen profiling for disease management and surveillance , 2007, Nature Reviews Microbiology.

[38]  Bart De Moor,et al.  Comparison of vocabularies, representations and ranking algorithms for gene prioritization by text mining , 2008, ECCB.

[39]  D. Swanson Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge , 2015, Perspectives in biology and medicine.

[40]  Erik M. van Mulligen,et al.  Co-occurrence based meta-analysis of scientific texts: retrieving biological relationships between genes , 2005, Bioinform..

[41]  Ken E. Whelan,et al.  The Automation of Science , 2009, Science.

[42]  M. Rivera,et al.  Analysis of genomic and proteomic data using advanced literature mining. , 2003, Journal of proteome research.