dRiskKB: a large-scale disease-disease risk relationship knowledge base constructed from biomedical text

BackgroundDiscerning the genetic contributions to complex human diseases is a challenging mandate that demands new types of data and calls for new avenues for advancing the state-of-the-art in computational approaches to uncovering disease etiology. Systems approaches to studying observable phenotypic relationships among diseases are emerging as an active area of research for both novel disease gene discovery and drug repositioning. Currently, systematic study of disease relationships on a phenome-wide scale is limited due to the lack of large-scale machine understandable disease phenotype relationship knowledge bases. Our study innovates a semi-supervised iterative pattern learning approach that is used to build an precise, large-scale disease-disease risk relationship (D1 →D2) knowledge base (dRiskKB) from a vast corpus of free-text published biomedical literature.Results21,354,075 MEDLINE records comprised the text corpus under study. First, we used one typical disease risk-specific syntactic pattern (i.e. “D1 due to D2”) as a seed to automatically discover other patterns specifying similar semantic relationships among diseases. We then extracted D1 →D2 risk pairs from MEDLINE using the learned patterns. We manually evaluated the precisions of the learned patterns and extracted pairs. Finally, we analyzed the correlations between disease-disease risk pairs and their associated genes and drugs. The newly created dRiskKB consists of a total of 34,448 unique D1 →D2 pairs, representing the risk-specific semantic relationships among 12,981 diseases with each disease linked to its associated genes and drugs. The identified patterns are highly precise (average precision of 0.99) in specifying the risk-specific relationships among diseases. The precisions of extracted pairs are 0.919 for those that are exactly matched and 0.988 for those that are partially matched. By comparing the iterative pattern approach starting from different seeds, we demonstrated that our algorithm is robust in terms of seed choice. We show that diseases and their risk diseases as well as diseases with similar risk profiles tend to share both genes and drugs.ConclusionsThis unique dRiskKB, when combined with existing phenotypic, genetic, and genomic datasets, can have profound implications in our deeper understanding of disease etiology and in drug repositioning.

[1]  G. Vriend,et al.  A text-mining analysis of the human phenome , 2006, European Journal of Human Genetics.

[2]  Ellen Riloff,et al.  A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts , 2002, EMNLP.

[3]  Marcelo Fiszman,et al.  Identifying Risk Factors for Metabolic Syndrome in Biomedical Text , 2007, AMIA.

[4]  Roded Sharan,et al.  Associating Genes and Protein Complexes with Disease via Network Propagation , 2010, PLoS Comput. Biol..

[5]  Gerhard Weikum,et al.  Scalable knowledge harvesting with high precision and high recall , 2011, WSDM '11.

[6]  Amar K. Das,et al.  Unsupervised Method for Automatic Construction of a Disease Dictionary from a Large Free Text Collection , 2008, AMIA.

[7]  Xin Yao,et al.  Modularity-based credible prediction of disease genes and detection of disease subtypes on the phenotype-gene heterogeneous network , 2011, BMC Systems Biology.

[8]  K. Cohen,et al.  Biomedical language processing: what's beyond PubMed? , 2006, Molecular cell.

[9]  Amar K. Das,et al.  Unsupervised Method for Extracting Machine Understandable Medical Knowledge from a Large Free Text Collection , 2009, AMIA.

[10]  K. Bretonnel Cohen,et al.  Rapid Pattern Development for Concept Recognition Systems: Application to Point mutations , 2007, J. Bioinform. Comput. Biol..

[11]  A. Barabasi,et al.  Molecular Systems Biology 5; Article number 262; doi:10.1038/msb.2009.16 Citation: Molecular Systems Biology 5:262 , 2022 .

[12]  TaeHyun Hwang,et al.  Inferring disease and gene set associations with rank coherence in networks , 2011, Bioinform..

[13]  M. Huynen,et al.  Phenome connections. , 2008, Trends in genetics : TIG.

[14]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[15]  Xingli Guo,et al.  A Computational Method Based on the Integration of Heterogeneous Networks for Predicting Disease-Gene Associations , 2011, PloS one.

[16]  C. Sabatti,et al.  The Human Phenome Project , 2003, Nature Genetics.

[17]  Xiaoli Li,et al.  Inferring Gene-Phenotype Associations via Global Protein Complex Network Propagation , 2011, PloS one.

[18]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2004, Nucleic Acids Res..

[19]  Tyrone D. Cannon,et al.  Phenomics: the systematic study of phenotypes on a genome-wide scale , 2009, Neuroscience.

[20]  Søren Brunak,et al.  Using Electronic Patient Records to Discover Disease Correlations and Stratify Patient Cohorts , 2011, PLoS Comput. Biol..

[21]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[22]  Alexander A. Morgan,et al.  Investigation of Unsupervised Pattern Learning Techniques for Bootstrap Construction of a Medical Treatment Lexicon , 2009, BioNLP@HLT-NAACL.

[23]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[24]  Klaus A. Kuhn,et al.  MEDINFO 2007 - Proceedings of the 12th World Congress on Health (Medical) Informatics - Building Sustainable Health Systems, 20-24 August, 2007, Brisbane, Australia , 2007 .

[25]  A. Rzhetsky,et al.  Probing genetic overlap among complex human phenotypes , 2007, Proceedings of the National Academy of Sciences.

[26]  J. Aronson,et al.  BMC Medical Research Methodology BioMed Central BMC 1 2001, Medical Research Methodology , 2001 .

[27]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[28]  S. Omholt,et al.  Phenomics: the next challenge , 2010, Nature Reviews Genetics.

[29]  Luis Gravano,et al.  Extracting Relations from Large Plain-Text Collections , 1999 .

[30]  Pall I. Olason,et al.  A human phenome-interactome network of protein complexes implicated in genetic disorders , 2007, Nature Biotechnology.

[31]  Paul N. Schofield,et al.  PhenomeNET: a whole-phenome approach to disease gene discovery , 2011, Nucleic acids research.

[32]  Atul J. Butte,et al.  The "etiome": identification and clustering of human disease etiological factors , 2009, BMC Bioinformatics.

[33]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[34]  K. Bretonnel Cohen,et al.  Getting Started in Text Mining , 2008, PLoS Comput. Biol..

[35]  Krin A. Kay,et al.  The implications of human metabolic network topology for disease comorbidity , 2008, Proceedings of the National Academy of Sciences.

[36]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[37]  Rong Xu,et al.  Large-scale extraction of accurate drug-disease treatment pairs from biomedical literature for drug repurposing , 2013, BMC Bioinformatics.

[38]  Haim Levkowitz,et al.  Introduction to information retrieval (IR) , 2008 .

[39]  Junichi Tsujii,et al.  Event extraction for systems biology by text mining the literature. , 2010, Trends in biotechnology.

[40]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[41]  R. Solé,et al.  Data completeness—the Achilles heel of drug-target networks , 2008, Nature Biotechnology.

[42]  Michael Q. Zhang,et al.  Network-based global inference of human disease genes , 2008, Molecular systems biology.

[43]  Yang Chen,et al.  Semi-supervised image classification for automatic construction of a health image library , 2012, IHI '12.

[44]  Jagdish Chandra Patra,et al.  Genome-wide inferring gene-phenotype relationship by walking on the heterogeneous network , 2010, Bioinform..