Classifying genes to the correct Gene Ontology Slim term in Saccharomyces cerevisiae using neighbouring genes with classification learning

BackgroundThere is increasing evidence that gene location and surrounding genes influence the functionality of genes in the eukaryotic genome. Knowing the Gene Ontology Slim terms associated with a gene gives us insight into a gene's functionality by informing us how its gene product behaves in a cellular context using three different ontologies: molecular function, biological process, and cellular component. In this study, we analyzed if we could classify a gene in Saccharomyces cerevisiae to its correct Gene Ontology Slim term using information about its location in the genome and information from its nearest-neighbouring genes using classification learning.ResultsWe performed experiments to establish that the MultiBoostAB algorithm using the J48 classifier could correctly classify Gene Ontology Slim terms of a gene given information regarding the gene's location and information from its nearest-neighbouring genes for training. Different neighbourhood sizes were examined to determine how many nearest neighbours should be included around each gene to provide better classification rules. Our results show that by just incorporating neighbour information from each gene's two-nearest neighbours, the percentage of correctly classified genes to their correct Gene Ontology Slim term for each ontology reaches over 80% with high accuracy (reflected in F-measures over 0.80) of the classification rules produced.ConclusionsWe confirmed that in classifying genes to their correct Gene Ontology Slim term, the inclusion of neighbour information from those genes is beneficial. Knowing the location of a gene and the Gene Ontology Slim information from neighbouring genes gives us insight into that gene's functionality. This benefit is seen by just including information from a gene's two-nearest neighbouring genes.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[3]  E. Sonnhammer,et al.  Genomic gene clustering analysis of pathways in eukaryotes. , 2003, Genome research.

[4]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[5]  Louxin Zhang,et al.  Genome-scale analysis of positional clustering of mouse testis-specific genes , 2005, BMC Genomics.

[6]  T. Blumenthal Gene clusters and polycistronic transcription in eukaryotes , 1998, BioEssays : news and reviews in molecular, cellular and developmental biology.

[7]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[8]  G. Church,et al.  A computational analysis of whole-genome expression data reveals chromosomal domains of gene expression , 2000, Nature Genetics.

[9]  Patricia C Babbitt,et al.  Can sequence determine function? , 2000, Genome Biology.

[10]  Michael J E Sternberg,et al.  Clustering of protein domains in the human genome. , 2004, Journal of molecular biology.

[11]  T. Liesegang The human transcriptome map: Clustering of highly expressed genes in chromosomal domains. Caron H, ∗ van Schaik B, van der Mee M, et al. Science 2001;291:1289–1292. , 2001 .

[12]  Charles DeLisi,et al.  Identifying functional links between genes using conserved chromosomal proximity. , 2002, Trends in genetics : TIG.

[13]  I. Kohane,et al.  Inter-species differences of co-expression of neighboring genes in eukaryotic genomes , 2004, BMC Genomics.

[14]  S. Kruglyak,et al.  Regulation of adjacent yeast genes. , 2000, Trends in genetics : TIG.

[15]  Martin J. Lercher,et al.  Clustering of housekeeping genes provides a unified model of gene order in the human genome , 2002, Nature Genetics.

[16]  Mikhail S. Gelfand,et al.  Combining diverse evidence for gene recognition in completely sequenced bacterial genomes , 1998, German Conference on Bioinformatics.

[17]  J. Davies,et al.  Molecular Biology of the Cell , 1983, Bristol Medico-Chirurgical Journal.

[18]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[19]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[20]  Gerald M Rubin,et al.  Evidence for large domains of similarly expressed genes in the Drosophila genome , 2002, Journal of biology.

[21]  Yuri Y. Shevelyov,et al.  Large clusters of co-expressed genes in the Drosophila genome , 2002, Nature.

[22]  F. Baas,et al.  The Human Transcriptome Map: Clustering of Highly Expressed Genes in Chromosomal Domains , 2001, Science.

[23]  Geoffrey I. Webb,et al.  MultiBoosting: A Technique for Combining Boosting and Wagging , 2000, Machine Learning.

[24]  Joshua M. Stuart,et al.  Chromosomal clustering of muscle-expressed genes in Caenorhabditis elegans , 2002, Nature.

[25]  B. Snel,et al.  Conservation of gene order: a fingerprint of proteins that physically interact. , 1998, Trends in biochemical sciences.

[26]  Michael Y. Galperin,et al.  Who's your neighbor? New computational approaches for functional genomics , 2000, Nature Biotechnology.

[27]  B. Alberts,et al.  Molecular Biology of the Cell 4th edition , 2007 .

[28]  Søren Brunak,et al.  Prediction of human protein function according to Gene Ontology categories , 2003, Bioinform..

[29]  Laurence D. Hurst,et al.  Evidence for co-evolution of gene order and recombination rate , 2003, Nature Genetics.

[30]  C. Ball,et al.  Saccharomyces Genome Database. , 2002, Methods in enzymology.