GO-WORDS: An Entropic Approach to Semantic Decomposition of Gene Ontology Terms

ABSTRACT The Gene Ontology (GO) has a large and growing number of terms that constitute its vocabulary. An entropy-based ap-proach is presented to automate the characterization of the compositional semantics of GO terms. The motivation is to extend the machine-readability of GO and to offer insights for the continued maintenance and growth of GO. A proto-type implementation illustrates the benefits of the approach. 1 INTRODUCTION The underlying motivation of the work described in this paper is to map annotations based on the Gene Ontology (GO) (Ashburner, et al., 2000) to a semantic representation that exposes the internal semantics of GO terms to computer programs. The Gene Ontology (GO) views each gene prod-uct as being a structural component of a biological entity, being involved in a biological process, and as having a mo-lecular function. These three dimensions of component (C), process (P) and function (F) are hierarchically refined into several thousand subconcepts or GO terms for a fine-grained description of gene products, and ultimately a repre-sentation of collective biological knowledge. The machine-readability of GO is based on explicit IS-A or PART-OF relations between different GO terms (Fig. 1). The represen-tation of each GO term in terms of a phrase in English is primarily meant for human readability, and not machine-readability (Wroe, et al., 2003) (Fig. 1). For example, while both humans and computer programs can understand that ‘Folic Acid Transporter Activity’ is one kind of ‘Vitamin Transporter Activity,” only a human reader can appreciate that proteins annotated with ‘Folic Acid Transporter Activ-ity’ actually

[1]  Robert Stevens,et al.  Using reasoning to guide annotation with gene ontology terms in GOAT , 2004, SGMD.

[2]  Christopher J. Mungall,et al.  Obol: Integrating Language and Meaning in Bio-Ontologies , 2004, Comparative and functional genomics.

[3]  K. Bretonnel Cohen,et al.  The Compositional Structure of Gene Ontology Terms , 2003, Pacific Symposium on Biocomputing.

[4]  Carole A. Goble,et al.  A Methodology to Migrate the Gene Ontology to a Description Logic Environment Using DAML+OIL , 2002, Pacific Symposium on Biocomputing.

[5]  Mário J. Silva,et al.  Finding genomic ontology terms in text using evidence content , 2005, BMC Bioinformatics.

[6]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[7]  Zhiyong Lu,et al.  Evaluation of Lexical Methods for Detecting Relationships Between Concepts from Multiple Ontologies , 2006, Pacific Symposium on Biocomputing.

[8]  Alfonso Valencia,et al.  Evaluation of BioCreAtIvE assessment of task 2 , 2005, BMC Bioinformatics.

[9]  Tim Furche,et al.  HOW TO QUERY THE GENEONTOLOGY , 2005 .

[10]  Jong C. Park,et al.  Automatic extension of Gene Ontology with flexible identification of candidate terms , 2006, Bioinform..

[11]  Yugyung Lee,et al.  Model Formulation: MachineProse: An Ontological Framework for Scientific Assertions , 2006, J. Am. Medical Informatics Assoc..

[12]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[13]  Michael Schroeder,et al.  GoPubMed: exploring PubMed with the Gene Ontology , 2005, Nucleic Acids Res..