KnowLife: a versatile approach for constructing a large knowledge graph for biomedical sciences

BackgroundBiomedical knowledge bases (KB’s) have become important assets in life sciences. Prior work on KB construction has three major limitations. First, most biomedical KBs are manually built and curated, and cannot keep up with the rate at which new findings are published. Second, for automatic information extraction (IE), the text genre of choice has been scientific publications, neglecting sources like health portals and online communities. Third, most prior work on IE has focused on the molecular level or chemogenomics only, like protein-protein interactions or gene-drug relationships, or solely address highly specific topics such as drug effects.ResultsWe address these three limitations by a versatile and scalable approach to automatic KB construction. Using a small number of seed facts for distant supervision of pattern-based extraction, we harvest a huge number of facts in an automated manner without requiring any explicit training.We extend previous techniques for pattern-based IE with confidence statistics, and we combine this recall-oriented stage with logical reasoning for consistency constraint checking to achieve high precision. To our knowledge, this is the first method that uses consistency checking for biomedical relations. Our approach can be easily extended to incorporate additional relations and constraints.We ran extensive experiments not only for scientific publications, but also for encyclopedic health portals and online communities, creating different KB’s based on different configurations. We assess the size and quality of each KB, in terms of number of facts and precision. The best configured KB, KnowLife, contains more than 500,000 facts at a precision of 93% for 13 relations covering genes, organs, diseases, symptoms, treatments, as well as environmental and lifestyle risk factors.ConclusionKnowLife is a large knowledge base for health and life sciences, automatically constructed from different Web sources. As a unique feature, KnowLife is harvested from different text genres such as scientific publications, health portals, and online communities. Thus, it has the potential to serve as one-stop portal for a wide range of relations and use cases. To showcase the breadth and usefulness, we make the KnowLife KB accessible through the health portal (http://knowlife.mpi-inf.mpg.de).

[1]  David S. Johnson,et al.  Approximation algorithms for combinatorial problems , 1973, STOC.

[2]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[3]  Michael P. H. Stumpf,et al.  Which species is it? Species-driven gene name disambiguation using random walks over a mixture of adjacency matrices , 2012, Bioinform..

[4]  Olivier Bodenreider,et al.  Aggregating UMLS Semantic Types for Reducing Conceptual Complexity , 2001, MedInfo.

[5]  Luis Gravano,et al.  Extracting Relations from Large Plain-Text Collections , 1999 .

[6]  Gerhard Weikum,et al.  Knowledge harvesting from text and Web sources , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[7]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[8]  Geert Vandeweyer,et al.  CNV-WebStore: Online CNV Analysis, Storage and Interpretation , 2011, BMC Bioinformatics.

[9]  Barend Mons,et al.  Open PHACTS: semantic interoperability for drug discovery. , 2012, Drug discovery today.

[10]  Ulf Leser,et al.  GeneView: a comprehensive semantic search engine for PubMed , 2012, Nucleic Acids Res..

[11]  Jun'ichi Tsujii,et al.  Event Extraction with Complex Event Classification Using Rich Features , 2010, J. Bioinform. Comput. Biol..

[12]  F. Sanz,et al.  A Knowledge-Driven Approach to Extract Disease-Related Biomarkers from the Literature , 2014, BioMed research international.

[13]  Zachary F. Meisel,et al.  Crowdsourcing—Harnessing the Masses to Advance Health and Medicine, a Systematic Review , 2013, Journal of General Internal Medicine.

[14]  Teruyoshi Hishiki,et al.  Extraction of Gene-Disease Relations from Medline Using Domain Dictionaries and Machine Learning , 2005, Pacific Symposium on Biocomputing.

[15]  R. Altman,et al.  Pharmacogenomics Knowledge for Personalized Medicine , 2012, Clinical pharmacology and therapeutics.

[16]  Zhiyong Lu,et al.  Hybrid curation of gene–mutation relations combining automated extraction and crowdsourcing , 2014, Database J. Biol. Databases Curation.

[17]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[18]  Hsinchun Chen,et al.  Genescene: An ontology-enhanced integration of linguistic and co-occurrence based relations in biomedical texts , 2005, J. Assoc. Inf. Sci. Technol..

[19]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[20]  Zhiyong Lu,et al.  BioCreative III interactive task: an overview , 2011, BMC Bioinformatics.

[21]  Halil Kilicoglu,et al.  Semantic Relations Asserting the Etiology of Genetic Diseases , 2003, AMIA.

[22]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[23]  Alberto O. Mendelzon,et al.  Selected papers from the International Workshop on The World Wide Web and Databases , 1998 .

[24]  Juliane Fluck,et al.  Identification of new drug classification terms in textual resources , 2007, ISMB/ECCB.

[25]  Denilson Barbosa,et al.  Shallow Information Extraction for the knowledge Web , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[26]  Alfonso Valencia,et al.  Extraction of human kinase mutations from literature, databases and genotyping studies , 2009, BMC Bioinformatics.

[27]  Amy Siu,et al.  Fast entity recognition in biomedical text , 2013 .

[28]  E. Horvitz,et al.  Toward Enhanced Pharmacovigilance Using Patient-Generated Data on the Internet , 2014, Clinical pharmacology and therapeutics.

[29]  Gerhard Weikum,et al.  KnowLife: A knowledge graph for health and life sciences , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[30]  Dan Roth,et al.  Gauging the internet doctor: ranking medical claims based on community knowledge , 2011, DMMH '11.

[31]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[32]  Gerhard Weikum,et al.  DIDO: a disease-determinants ontology from web sources , 2011, WWW.

[33]  Cathy H. Wu,et al.  Text Mining of Protein Phosphorylation Information Using a Generalizable Rule-Based Approach , 2013, BCB.

[34]  Rong Xu,et al.  dRiskKB: a large-scale disease-disease risk relationship knowledge base constructed from biomedical text , 2014, BMC Bioinformatics.

[35]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[36]  Gerhard Weikum,et al.  Scalable knowledge harvesting with high precision and high recall , 2011, WSDM '11.

[37]  Zhiyong Lu,et al.  OpenDMAP: An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression , 2008, BMC Bioinformatics.

[38]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[39]  Udo Hahn,et al.  Event Extraction from Trimmed Dependency Graphs , 2009, BioNLP@HLT-NAACL.

[40]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[41]  Rolf Niedermeier,et al.  New Upper Bounds for Maximum Satisfiability , 2000, J. Algorithms.

[42]  Erik Cambria,et al.  Sentic patterns: Dependency-based rules for concept-level sentiment analysis , 2014, Knowl. Based Syst..

[43]  Hans-Peter Kriegel,et al.  Extraction of semantic biomedical relations from text using conditional random fields , 2008, BMC Bioinformatics.

[44]  Benjamin M. Good,et al.  Crowdsourcing for bioinformatics , 2013, Bioinform..

[45]  Nigel Collier,et al.  PASBio: predicate-argument structures for event extraction in molecular biology , 2004, BMC Bioinformatics.

[46]  Gerhard Weikum,et al.  People on drugs: credibility of user statements in health communities , 2014, KDD.

[47]  Sampo Pyysalo,et al.  Event extraction across multiple levels of biological organization , 2012, Bioinform..

[48]  Anna Rumshisky,et al.  Evaluating temporal relations in clinical text: 2012 i2b2 Challenge , 2013, J. Am. Medical Informatics Assoc..

[49]  Anna Rumshisky,et al.  Research and applications: Word sense disambiguation in the clinical domain: a comparison of knowledge-rich and knowledge-poor unsupervised methods , 2014, J. Am. Medical Informatics Assoc..

[50]  Lora Aroyo,et al.  Measuring Crowd Truth for Medical Relation Extraction , 2013, AAAI Fall Symposia.

[51]  Marcelo Fiszman,et al.  The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text , 2003, J. Biomed. Informatics.

[52]  Jun'ichi Tsujii,et al.  Corpus annotation for mining biomedical events from literature , 2008, BMC Bioinformatics.

[53]  Barbara Rosario,et al.  Classifying Semantic Relations in Bioscience Texts , 2004, ACL.

[54]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[55]  William W. Cohen,et al.  Bootstrapping Biomedical Ontologies for Scientific Text using NELL , 2012, BioNLP@HLT-NAACL.

[56]  Gerhard Weikum,et al.  SOFIE: a self-organizing framework for information extraction , 2009, WWW '09.

[57]  Jari Björne,et al.  Generalizing Biomedical Event Extraction , 2011, BioNLP@ACL.