Desiderata for ontologies to be used in semantic annotation of biomedical documents

A wealth of knowledge valuable to the translational research scientist is contained within the vast biomedical literature, but this knowledge is typically in the form of natural language. Sophisticated natural-language-processing systems are needed to translate text into unambiguous formal representations grounded in high-quality consensus ontologies, and these systems in turn rely on gold-standard corpora of annotated documents for training and testing. To this end, we are constructing the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-text biomedical journal articles that are being manually annotated with the entire sets of terms from select vocabularies, predominantly from the Open Biomedical Ontologies (OBO) library. Our efforts in building this corpus has illuminated infelicities of these ontologies with respect to the semantic annotation of biomedical documents, and we propose desiderata whose implementation could substantially improve their utility in this task; these include the integration of overlapping terms across OBOs, the resolution of OBO-specific ambiguities, the integration of the BFO with the OBOs and the use of mid-level ontologies, the inclusion of noncanonical instances, and the expansion of relations and realizable entities.

[1]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[2]  Alan Ruttenberg,et al.  MIREOT: The minimum information to reference an external ontology term , 2009, Appl. Ontology.

[3]  Albert Burger,et al.  Anatomy Ontologies for Bioinformatics, Principles and Practice , 2010, Anatomy Ontologies for Bioinformatics.

[4]  Jari Björne,et al.  BioInfer: a corpus for information extraction in the biomedical domain , 2007, BMC Bioinformatics.

[5]  Matthew Lease,et al.  Parsing Biomedical Literature , 2005, IJCNLP.

[6]  O Bodenreider,et al.  Biomedical ontologies in action: role in knowledge management, data integration and decision support. , 2008, Yearbook of medical informatics.

[7]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[8]  Cornelius Rosse,et al.  The Foundational Model of Anatomy Ontology , 2008, Anatomy Ontologies for Bioinformatics.

[9]  Lucila Ohno-Machado,et al.  Natural language processing: an introduction , 2011, J. Am. Medical Informatics Assoc..

[10]  Purvesh Khatri,et al.  Ontological analysis of gene expression data: current tools, limitations, and open problems , 2005, Bioinform..

[11]  Hongfang Liu,et al.  Framework for a Protein Ontology , 2007, BMC Bioinformatics.

[12]  Thomas R. Gruber,et al.  Toward principles for the design of ontologies used for knowledge sharing? , 1995, Int. J. Hum. Comput. Stud..

[13]  M. Ashburner,et al.  An ontology for cell types , 2005, Genome Biology.

[14]  Domenico M. Pisanelli,et al.  Ontologies in Medicine , 2004 .

[15]  Jonas S. Almeida,et al.  Ontology Design Principles and Normalization Techniques in the Web , 2008, DILS.

[16]  K. Cohen,et al.  Biomedical language processing: what's beyond PubMed? , 2006, Molecular cell.

[17]  Lawrence Hunter,et al.  Biomedical Discovery Acceleration, with Applications to Craniofacial Development , 2009, PLoS Comput. Biol..

[18]  Alan L. Rector Defaults, Context, and Knowledge: Alternatives for OWL-Indexed Knowledge Bases , 2004, Pacific Symposium on Biocomputing.

[19]  Angus Roberts,et al.  The CLEF Corpus: Semantic Annotation of Clinical Text , 2007, AMIA.

[20]  Lawrence Hunter,et al.  An Overview of the CRAFT Concept Annotation Guidelines , 2010, Linguistic Annotation Workshop.

[21]  Robert Hoehndorf,et al.  Representing default knowledge in biomedical ontologies: application to the integration of anatomy and phenotype ontologies , 2007, BMC Bioinformatics.

[22]  Sophia Ananiadou,et al.  Text mining and its potential applications in systems biology. , 2006, Trends in biotechnology.

[23]  William R. Hersh,et al.  A survey of current work in biomedical text mining , 2005, Briefings Bioinform..

[24]  Philip V. Ogren,et al.  Knowtator: A Protégé plug-in for annotated corpus construction , 2006, NAACL.

[25]  Michael Darsow,et al.  ChEBI: a database and ontology for chemical entities of biological interest , 2007, Nucleic Acids Res..

[26]  Christian Blaschke,et al.  Status of text-mining techniques applied to biomedical text. , 2006, Drug discovery today.

[27]  J. Cimino Desiderata for Controlled Medical Vocabularies in the Twenty-First Century , 1998, Methods of Information in Medicine.

[28]  Lawrence Hunter,et al.  Using the Gene Ontology to Annotate Biomedical Journal Articles , 2009 .

[29]  Barry Smith,et al.  Biodynamic ontology: applying BFO in the biomedical domain. , 2004, Studies in health technology and informatics.

[30]  R. Durbin,et al.  The Sequence Ontology: a tool for the unification of genome annotations , 2005, Genome Biology.

[31]  Seth Kulick,et al.  Integrated Annotation for Biomedical Information Extraction , 2004, HLT-NAACL 2004.

[32]  Tanya Z. Berardini,et al.  Cross-product extensions of the Gene Ontology , 2009, Journal of Biomedical Informatics.

[33]  Michael Hehenberger,et al.  Text-based knowledge discovery: search and mining of life-sciences documents. , 2002, Drug discovery today.

[34]  Alan L Rector Anatomy for Clinical Terminology , 2008, Anatomy Ontologies for Bioinformatics.

[35]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[36]  Lorraine K. Tanabe,et al.  GENETAG: a tagged corpus for gene/protein named entity recognition , 2005, BMC Bioinformatics.

[37]  M. Orešič,et al.  Pathways to the analysis of microarray data. , 2005, Trends in biotechnology.

[38]  Alan L. Rector,et al.  Untangling taxonomies and relationships: personal and practical problems in loosely coupled development of large ontologies , 2001, K-CAP '01.

[39]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[40]  Kevin Knight,et al.  Toward Distributed Use of Large-Scale Ontologies t , 1997 .

[41]  M. Ashburner,et al.  The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration , 2007, Nature Biotechnology.

[42]  Sophia Ananiadou,et al.  Construction of an annotated corpus to support biomedical information extraction , 2009, BMC Bioinformatics.

[43]  A. Rector,et al.  Relations in biomedical ontologies , 2005, Genome Biology.

[44]  K. Bretonnel Cohen,et al.  Frontiers of biomedical text mining: current progress , 2007, Briefings Bioinform..

[45]  James J. Cimino,et al.  In defense of the Desiderata , 2005, Journal of Biomedical Informatics.

[46]  Judith A. Blake,et al.  Gene Ontology annotations: what they mean and where they come from , 2008, BMC Bioinformatics.

[47]  Barry Smith,et al.  Beyond Concepts: Ontology as Reality Representation , 2004 .

[48]  Hagit Shatkay,et al.  New directions in biomedical text annotation: definitions, guidelines and corpus construction , 2006, BMC Bioinformatics.

[49]  Suzanna E. Lewis,et al.  Uberon: towards a comprehensive multi-species anatomy ontology , 2009 .

[50]  Anita Burgun-Parenthoine,et al.  Desiderata for domain reference ontologies in biomedicine , 2006, J. Biomed. Informatics.

[51]  Henrik Eriksson,et al.  The evolution of Protégé: an environment for knowledge-based systems development , 2003, Int. J. Hum. Comput. Stud..

[52]  Jun'ichi Tsujii,et al.  Corpus annotation for mining biomedical events from literature , 2008, BMC Bioinformatics.

[53]  Barry Smith,et al.  On Carcinomas and Other Pathological Entities , 2005, Comparative and functional genomics.

[54]  Barry Smith,et al.  From concepts to clinical reality: An essay on the benchmarking of biomedical terminologies , 2006, J. Biomed. Informatics.

[55]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[56]  Elena Beisswanger,et al.  BioTop: An upper domain ontology for the life sciencesA description of its current structure, contents and interfaces to OBO ontologies , 2008, Appl. Ontology.

[57]  Joel D. Martin,et al.  Getting to the (c)ore of knowledge: mining biomedical literature , 2002, Int. J. Medical Informatics.