Information theory applied to the sparse gene ontology annotation network to predict novel gene function

MOTIVATION Despite advances in the gene annotation process, the functions of a large portion of gene products remain insufficiently characterized. In addition, the in silico prediction of novel Gene Ontology (GO) annotations for partially characterized gene functions or processes is highly dependent on reverse genetic or functional genomic approaches. To our knowledge, no prediction method has been demonstrated to be highly accurate for sparsely annotated GO terms (those associated to fewer than 10 genes). RESULTS We propose a novel approach, information theory-based semantic similarity (ITSS), to automatically predict molecular functions of genes based on existing GO annotations. Using a 10-fold cross-validation, we demonstrate that the ITSS algorithm obtains prediction accuracies (precision 97%, recall 77%) comparable to other machine learning algorithms when compared in similar conditions over densely annotated portions of the GO datasets. This method is able to generate highly accurate predictions in sparsely annotated portions of GO, where previous algorithms have failed. As a result, our technique generates an order of magnitude more functional predictions than previous methods. A 10-fold cross validation demonstrated a precision of 90% at a recall of 36% for the algorithm over sparsely annotated networks of the recent GO annotations (about 1400 GO terms and 11,000 genes in Homo sapiens). To our knowledge, this article presents the first historical rollback validation for the predicted GO annotations, which may represent more realistic conditions than more widely used cross-validation approaches. By manually assessing a random sample of 100 predictions conducted in a historical rollback evaluation, we estimate that a minimum precision of 51% (95% confidence interval: 43-58%) can be achieved for the human GO Annotation file dated 2003. AVAILABILITY The program is available on request. The 97,732 positive predictions of novel gene annotations from the 2005 GO Annotation dataset and other supplementary information is available at http://phenos.bsd.uchicago.edu/ITSS/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Jung-Hsien Chiang,et al.  GeneLibrarian: an effective gene-information summarization and visualization system , 2006, BMC Bioinformatics.

[2]  S. Dwight,et al.  Predicting gene function from patterns of annotation. , 2003, Genome research.

[3]  Janet M Thornton,et al.  Protein function prediction using local 3D templates. , 2005, Journal of molecular biology.

[4]  Babak Shahbaba,et al.  Gene function classification using Bayesian models with hierarchy-based priors , 2006, BMC Bioinformatics.

[5]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[6]  Tero Aittokallio,et al.  Improving missing value estimation in microarray data with gene ontology , 2006, Bioinform..

[7]  Olivier Bodenreider,et al.  An ontology-driven clustering method for supporting gene expression analysis , 2005, 18th IEEE Symposium on Computer-Based Medical Systems (CBMS'05).

[8]  Olivier Bodenreider,et al.  Gene expression correlation and gene ontology-based similarity: an assessment of quantitative relationships , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[9]  J. Hanley,et al.  A method of comparing the areas under receiver operating characteristic curves derived from the same cases. , 1983, Radiology.

[10]  F. James Rohlf,et al.  Biometry: The Principles and Practice of Statistics in Biological Research , 1969 .

[11]  J. Jiang,et al.  Multi-word complex concept retrieval via lexical semantic similarity , 1999, Proceedings 1999 International Conference on Information Intelligence and Systems (Cat. No.PR00446).

[12]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[13]  Carole A. Goble,et al.  Semantic Similarity Measures as Tools for Exploring the Gene Ontology , 2002, Pacific Symposium on Biocomputing.

[14]  B. Aggarwal,et al.  VEGI, a new member of the TNF family activates Nuclear Factor-κB and c-Jun N-terminal kinase and modulates cell growth , 1999, Oncogene.

[15]  David Botstein,et al.  SGD: Saccharomyces Genome Database , 1998, Nucleic Acids Res..

[16]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[17]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[18]  Robert Stevens,et al.  Using reasoning to guide annotation with gene ontology terms in GOAT , 2004, SGMD.

[19]  A T McCray,et al.  From Phenotype to Genotype: Issues in Navigating the Available Information Resources , 2003, Methods of Information in Medicine.

[20]  Dong Xu,et al.  Global protein function annotation through mining genome-scale data in yeast Saccharomyces cerevisiae. , 2004, Nucleic acids research.

[21]  C. Metz Basic principles of ROC analysis. , 1978, Seminars in nuclear medicine.

[22]  Xiaomei Wu,et al.  Prediction of yeast protein–protein interaction network: insights from the Gene Ontology and annotations , 2006, Nucleic acids research.

[23]  Miguel A. Andrade-Navarro,et al.  Gene annotation from scientific literature using mappings between keyword systems , 2004, Bioinform..

[24]  Byoung-Tak Zhang,et al.  Collocation Dictionary Optimization Using WordNet and k-Nearest Neighbor Learning , 2001, Machine Translation.

[25]  Miguel A. Andrade-Navarro,et al.  Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families , 1998, Bioinform..

[26]  Myoung-Ho Kim,et al.  Using term dependencies of a thesaurus in the fuzzy set model , 1993, Microprocess. Microprogramming.

[27]  Andrey Rzhetsky,et al.  Microparadigms: chains of collective reasoning in publications about molecular interactions. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Yves A. Lussier,et al.  Evaluation of high-throughput functional categorization of human disease genes , 2007, BMC Bioinformatics.

[29]  Frank Holstege,et al.  Predicting gene function through systematic analysis and quality assessment of high-throughput data , 2005, Bioinform..

[30]  T. Uchiyama,et al.  Molecular cloning, genomic structure, chromosomal localization, and alternative splice forms of the platelet collagen receptor glycoprotein VI. , 2000, Biochemical and biophysical research communications.

[31]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[32]  Ute Baumann,et al.  BMC Bioinformatics BioMed Central Methodology article Automated methods of predicting the function of biological sequences using GO and BLAST , 2005 .

[33]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[34]  Carl J. Schmidt,et al.  GoFigure: Automated Gene OntologyTM annotation , 2003, Bioinform..