LEARNING WITH GENE ONTOLOGY ANNOTATION USING FEATURE SELECTION AND CONSTRUCTION

A key role for ontologies in bioinformatics is their use as a standardized, structured terminology, particularly to annotate the genes in a genome with functional and other properties. Since the output of many genome-scale experiments results in gene sets it is natural to ask if they share a common function. A standard approach is to apply a statistical test for overrepresentation of functional annotation, often within the gene ontology. In this article we propose an alternative to the standard approach that avoids problems in overrepresentation analysis due to statistical dependencies between ontology categories. We apply methods of feature construction and selection to preprocess gene ontology terms used for the annotation of gene sets and incorporate these features as input to a standard supervised machine-learning algorithm. Our approach is shown to allow the straightforward use of an ontology in the context of data sourced from multiple experiments to learn classifiers predicting gene function as part of a cellular response to environmental stress.

[1]  Carole A. Goble,et al.  A short study on the success of the Gene Ontology , 2004, J. Web Semant..

[2]  Michael Bain,et al.  Learning from ontological annotation: an application of formal concept analysis to feature construction in the gene ontology , 2007 .

[3]  J. Buhler,et al.  The H2O2 Stimulon in Saccharomyces cerevisiae * , 1998, The Journal of Biological Chemistry.

[4]  Carole A. Goble,et al.  A Methodology to Migrate the Gene Ontology to a Description Logic Environment Using DAML+OIL , 2002, Pacific Symposium on Biocomputing.

[5]  E. Lander,et al.  Remodeling of yeast genome expression in response to environmental changes. , 2001, Molecular biology of the cell.

[6]  Ronald W. Davis,et al.  Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. , 1999, Science.

[7]  Rokia Missaoui,et al.  An Incremental Concept Formation Approach for Learning from Databases , 1994, Theor. Comput. Sci..

[8]  Olivier Bodenreider,et al.  Bio-ontologies: current trends and future directions , 2006, Briefings Bioinform..

[9]  Tao Jiang,et al.  KEGG for Computational Genomics , 2002 .

[10]  Diego Calvanese,et al.  The Description Logic Handbook: Theory, Implementation, and Applications , 2003, Description Logic Handbook.

[11]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[12]  Robert Stevens,et al.  Using OWL to model biological knowledge , 2007, Int. J. Hum. Comput. Stud..

[13]  M. D. Temple,et al.  Complex cellular responses to reactive oxygen species. , 2005, Trends in cell biology.

[14]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[15]  H. Mewes,et al.  Complex functionality of gene groups identified from high-throughput data. , 2006, Journal of molecular biology.

[16]  Claudio Carpineto,et al.  GALOIS: An Order-Theoretic Approach to Conceptual Clustering , 1993, ICML.

[17]  G. W. Hatfield,et al.  DNA microarrays and gene expression , 2002 .

[18]  Vasant Honavar,et al.  Learning Ontology-Aware Classifiers , 2005, Discovery Science.

[19]  Michael Bain Structured Features from Concept Lattices for Unsupervised Learning and Classification , 2002, Australian Joint Conference on Artificial Intelligence.

[20]  Martin Vingron,et al.  Improved detection of overrepresentation of Gene-Ontology annotations with parent-child analysis , 2007, Bioinform..

[21]  Stanley Fields,et al.  Quantitative genome-wide analysis of yeast deletion strain sensitivities to oxidative and chemical stress , 2004 .

[22]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[23]  Nazif Alic,et al.  Cells have distinct mechanisms to maintain protection against different reactive oxygen species: oxidative-stress-response genes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Ian Witten,et al.  Data Mining , 2000 .

[25]  Qing Wang,et al.  Towards precise classification of cancers based on robust gene functional expression profiles , 2005, BMC Bioinformatics.

[26]  Bernhard Ganter,et al.  Formal Concept Analysis: Mathematical Foundations , 1998 .

[27]  Gregory J. Chaitin,et al.  Information, Randomness and Incompleteness - Papers on Algorithmic Information Theory; 2nd Edition , 1987, World Scientific Series in Computer Science.

[28]  David Botstein,et al.  GO: : TermFinder--open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes , 2004, Bioinform..

[29]  Vincent J. Carey,et al.  Ontology concepts and tools for statistical genomics , 2004 .

[30]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[31]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[32]  J. Bard,et al.  Ontologies in biology: design, applications and future challenges , 2004, Nature Reviews Genetics.

[33]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[34]  Thomas Lengauer,et al.  Improved scoring of functional groups from gene expression data by decorrelating GO graph structure , 2006, Bioinform..

[35]  Purvesh Khatri,et al.  Ontological analysis of gene expression data: current tools, limitations, and open problems , 2005, Bioinform..

[36]  金久 実,et al.  Post-genome informatics , 2000 .