PhenoMiner: from text to a database of phenotypes associated with OMIM diseases

Analysis of scientific and clinical phenotypes reported in the experimental literature has been curated manually to build high-quality databases such as the Online Mendelian Inheritance in Man (OMIM). However, the identification and harmonization of phenotype descriptions struggles with the diversity of human expressivity. We introduce a novel automated extraction approach called PhenoMiner that exploits full parsing and conceptual analysis. Apriori association mining is then used to identify relationships to human diseases. We applied PhenoMiner to the BMC open access collection and identified 13 636 phenotype candidates. We identified 28 155 phenotype-disorder hypotheses covering 4898 phenotypes and 1659 Mendelian disorders. Analysis showed: (i) the semantic distribution of the extracted terms against linked ontologies; (ii) a comparison of term overlap with the Human Phenotype Ontology (HP); (iii) moderate support for phenotype-disorder pairs in both OMIM and the literature; (iv) strong associations of phenotype-disorder pairs to known disease-genes pairs using PhenoDigm. The full list of PhenoMiner phenotypes (S1), phenotype-disorder associations (S2), association-filtered linked data (S3) and user database documentation (S5) is available as supplementary data and can be downloaded at http://github.com/nhcollier/PhenoMiner under a Creative Commons Attribution 4.0 license. Database URL: phenominer.mml.cam.ac.uk

[1]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2004, Nucleic Acids Res..

[2]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[3]  Nigel Collier,et al.  Learning to Recognize Phenotype Candidates in the Auto-Immune Literature Using SVM Re-Ranking , 2013, PloS one.

[4]  Nigel Collier,et al.  Toward knowledge support for analysis and interpretation of complex traits , 2013, Genome Biology.

[5]  Thomas C. Wiegers,et al.  The Comparative Toxicogenomics Database: update 2013 , 2012, Nucleic Acids Res..

[6]  Zhiyong Lu,et al.  Overview of the BioCreative III Workshop , 2011, BMC Bioinformatics.

[7]  Mary E. Mangan,et al.  The Adult Mouse Anatomical Dictionary: a tool for annotating and integrating data , 2005, Genome Biology.

[8]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[9]  Wei Ma,et al.  RxNorm: prescription for electronic drug information exchange , 2005, IT Professional.

[10]  Roger Levy,et al.  Tregex and Tsurgeon: tools for querying and manipulating tree data structures , 2006, LREC.

[11]  Sanna Salanterä,et al.  Overview of the ShARe/CLEF eHealth Evaluation Lab 2013 , 2013, CLEF.

[12]  Michael Darsow,et al.  ChEBI: a database and ontology for chemical entities of biological interest , 2007, Nucleic Acids Res..

[13]  Sampo Pyysalo,et al.  Overview of BioNLP Shared Task 2013 , 2013, BioNLP@ACL.

[14]  Christian Borgelt,et al.  Induction of Association Rules: Apriori Implementation , 2002, COMPSTAT.

[15]  Gang Feng,et al.  Disease Ontology: a backbone for disease semantic integration , 2011, Nucleic Acids Res..

[16]  Damian Smedley,et al.  PhenoDigm: analyzing curated annotations to associate animal models with human diseases , 2013, Database J. Biol. Databases Curation.

[17]  Thomas C. Wiegers,et al.  MEDIC: a practical disease vocabulary used at the Comparative Toxicogenomics Database , 2012, Database J. Biol. Databases Curation.

[18]  Caleb Webber,et al.  Phenotype Ontologies and Cross-Species Analysis for Translational Research , 2014, PLoS genetics.

[19]  Monte Westerfield,et al.  Linking Human Diseases to Animal Models Using Ontology-Based Phenotype Annotation , 2009, PLoS biology.

[20]  Olivier Bodenreider,et al.  Evaluation of the UMLS as a terminology and knowledge resource for biomedical informatics , 2002, AMIA.

[21]  Daniel L. Rubin,et al.  Comparison of concept recognizers for building the Open Biomedical Annotator , 2009, BMC Bioinformatics.

[22]  Cynthia L. Smith,et al.  Integrating phenotype ontologies across multiple species , 2010, Genome Biology.

[23]  Jane Hunter,et al.  Inferring characteristic phenotypes via class association rule mining in the bone dysplasia domain , 2014, J. Biomed. Informatics.

[24]  Jane Hunter,et al.  Supervised segmentation of phenotype descriptions for the human skeletal phenome using hybrid methods , 2012, BMC Bioinformatics.

[25]  José L. V. Mejino,et al.  A reference ontology for biomedical informatics: the Foundational Model of Anatomy , 2003, J. Biomed. Informatics.

[26]  P. Srinivasan,et al.  Mining MEDLINE: Postulating a Beneficial Role for Curcumin Longa in Retinal Diseases , 2004, HLT-NAACL 2004.

[27]  Irene Papatheodorou,et al.  Using association rule mining to determine promising secondary phenotyping hypotheses , 2014, Bioinform..

[28]  Kent A. Spackman,et al.  SNOMED clinical terms: overview of the development process and project status , 2001, AMIA.

[29]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[30]  Cornelius Rosse,et al.  A Reference Ontology for Bioinformatics: The Foundational Model of Anatomy , 2003 .

[31]  Robert E. Mercer,et al.  Improving Phenotype Name Recognition , 2011, Canadian Conference on AI.

[32]  Nigel Collier,et al.  Automatic concept recognition using the Human Phenotype Ontology reference and test suite corpora , 2015, Database J. Biol. Databases Curation.

[33]  Manuel Corpas,et al.  DECIPHER: web-based, community resource for clinical interpretation of rare variants in developmental disorders. , 2012, Human molecular genetics.

[34]  P. Robinson,et al.  The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. , 2008, American journal of human genetics.

[35]  R. Lyle,et al.  The imprinted antisense RNA at the Igf2r locus overlaps but does not imprint Mas1 , 2000, Nature Genetics.

[36]  Cynthia L. Smith,et al.  The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information , 2004, Genome Biology.

[37]  Nigel Collier,et al.  The impact of near domain transfer on biomedical named entity recognition , 2014, Louhi@EACL.

[38]  Csongor Nyulas,et al.  BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications , 2011, Nucleic Acids Res..

[39]  John M. Hancock,et al.  Entity/quality-based logical definitions for the human skeletal phenome using PATO , 2009, 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[40]  Eugene Charniak,et al.  Automatic Domain Adaptation for Parsing , 2010, NAACL.