Mining Skeletal Phenotype Descriptions from Scientific Literature

Phenotype descriptions are important for our understanding of genetics, as they enable the computation and analysis of a varied range of issues related to the genetic and developmental bases of correlated characters. The literature contains a wealth of such phenotype descriptions, usually reported as free-text entries, similar to typical clinical summaries. In this paper, we focus on creating and making available an annotated corpus of skeletal phenotype descriptions. In addition, we present and evaluate a hybrid Machine Learning approach for mining phenotype descriptions from free text. Our hybrid approach uses an ensemble of four classifiers and experiments with several aggregation techniques. The best scoring technique achieves an F-1 score of 71.52%, which is close to the state-of-the-art in other domains, where training data exists in abundance. Finally, we discuss the influence of the features chosen for the model on the overall performance of the method.

[1]  Lishuang Li,et al.  Integrating divergent models for gene mention tagging , 2009, 2009 International Conference on Natural Language Processing and Knowledge Engineering.

[2]  Hsin-Hsi Chen,et al.  Annotating Multiple Types of Biomedical Entities: A Single Word Classification Approach , 2004, NLPBA/BioNLP.

[3]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2002, Nucleic Acids Res..

[4]  Monte Westerfield,et al.  Linking Human Diseases to Animal Models Using Ontology-Based Phenotype Annotation , 2009, PLoS biology.

[5]  Marcel H. Schulz,et al.  Clinical diagnostics in human genetics with semantic similarity searches in ontologies. , 2009, American journal of human genetics.

[6]  José L. V. Mejino,et al.  A reference ontology for biomedical informatics: the Foundational Model of Anatomy , 2003, J. Biomed. Informatics.

[7]  A T McCray,et al.  Medical Informatics Research and Training at the Lister Hill National Center for Biomedical Communications , 2000, Yearbook of Medical Informatics.

[8]  John M. Hancock,et al.  Entity/quality-based logical definitions for the human skeletal phenome using PATO , 2009, 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[9]  Andreas Vlachos Gene Mention Tagging with CRFs and Parsing 1 Tackling the BioCreative 2 Gene Mention task with Conditional Random Fields and Syntactic Parsing , 2007 .

[10]  Yuji Matsumoto,et al.  Use of Support Vector Learning for Chunk Identification , 2000, CoNLL/LLL.

[11]  Jane Hunter,et al.  Semantic Similarity-Driven Decision Support in the Skeletal Dysplasia Domain , 2012, International Semantic Web Conference.

[12]  Cynthia L. Smith,et al.  The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information , 2004, Genome Biology.

[13]  Hongfei Lin,et al.  Exploiting the performance of dictionary-based bio-entity name recognition in biomedical literature , 2008, Comput. Biol. Chem..

[14]  Lorraine K. Tanabe,et al.  GENETAG: a tagged corpus for gene/protein named entity recognition , 2005, BMC Bioinformatics.

[15]  Tim Clark,et al.  Open semantic annotation of scientific publications using DOMEO , 2012, J. Biomed. Semant..

[16]  Jing Sun,et al.  Boosting performance of gene mention tagging system by hybrid methods , 2012, J. Biomed. Informatics.

[17]  Jane Hunter,et al.  Using Semantic Web Technologies to Build a Community-Driven Knowledge Curation Platform for the Skeletal Dysplasia Domain , 2011, SEMWEB.

[18]  Kalina Bontcheva,et al.  Text Processing with GATE , 2011 .

[19]  Chris Mungall,et al.  Phenotype ontologies: the bridge between genomics and evolution. , 2007, Trends in ecology & evolution.

[20]  Paul N. Schofield,et al.  PhenomeNET: a whole-phenome approach to disease gene discovery , 2011, Nucleic acids research.

[21]  Leslie G Biesecker,et al.  Standard terminology for phenotypic variations: The Elements of Morphology project, its current progress, and future directions , 2012, Human mutation.

[22]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[23]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[24]  P. Robinson,et al.  The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. , 2008, American journal of human genetics.

[25]  Robert Hoehndorf,et al.  Mouse genetic and phenotypic resources for human genetics , 2012, Human mutation.