Association rule mining of gene ontology annotation terms for SGD

Gene Ontology is one of the largest bioinformatics project that seeks to consolidate knowledge about genes through annotation of terms to three ontologies. In this work, we present a technique to find association relationships in the annotation terms for the Saccharomyces cerevisiae (SGD) genome. We first present a normalization algorithm to ensure that the annotation terms have a similar level of specificity. Association rule mining algorithms are used to find significant and non-trivial association rules in these normalized datasets. Metrics such as support, confidence, and lift can be used to evaluate the strength of found rules. We conducted experiments on the entire SGD annotation dataset and here we present the top 10 strongest rules for each of the three ontologies. We verify the found rules using evidence from the biomedical literature. The presented method has a number of advantages - it relies only on the structure of the gene ontology, has minimal memory and storage requirements, and can be easily scaled for large genomes, such as the human genome. There are many applications of this technique, such as predicting the GO annotations for new genes or those that have not been studied extensively.

[1]  Wyeth W Wasserman,et al.  Dynamics of the yeast transcriptome during wine fermentation reveals a novel fermentation stress response. , 2008, FEMS yeast research.

[2]  Yigong Shi Serine/Threonine Phosphatases: Mechanism through Structure , 2009, Cell.

[3]  Hisham Al-Mubaid,et al.  A New Path Length Measure Based on GO for Gene Similarity with Evaluation using SGD Pathways , 2008, 2008 21st IEEE International Symposium on Computer-Based Medical Systems.

[4]  Gediminas Adomavicius,et al.  Handling very large numbers of association rules in the analysis of microarray data , 2002, KDD.

[5]  Kenji Satou,et al.  Extraction of knowledge on protein-protein interaction by association rule discovery , 2002, Bioinform..

[6]  Holger Fröhlich,et al.  GOSim – an R-package for computation of information theoretic GO similarities between terms and gene products , 2007, BMC Bioinformatics.

[7]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[8]  Kurt Hornik,et al.  Introduction to arules – A computational environment for mining association rules and frequent item sets , 2009 .

[9]  Jian Wang,et al.  PRINCESS, a Protein Interaction Confidence Evaluation System with Multiple Data Sources*S , 2008, Molecular & Cellular Proteomics.

[10]  B. Miki,et al.  Possible mechanism for flocculation interactions governed by gene FLO1 in Saccharomyces cerevisiae , 1982, Journal of bacteriology.

[11]  Grant W. Brown,et al.  Functional dissection of protein complexes involved in yeast chromosome biology using a genetic interaction map , 2007, Nature.

[12]  Jonathan R. Warner,et al.  Potential Interface between Ribosomal Protein Production and Pre-rRNA Processing , 2007, Molecular and Cellular Biology.

[13]  Chia-Hui Yeh,et al.  SUMO modifications control assembly of synaptonemal complex and polycomplex in meiosis of Saccharomyces cerevisiae. , 2006, Genes & development.

[14]  Miao Wang,et al.  A New Measure Based on Gene Ontology for Semantic Similarity of Genes , 2010, 2010 WASE International Conference on Information Engineering.

[15]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[16]  Pornpimol Charoentong,et al.  ClueGO: a Cytoscape plug-in to decipher functionally grouped gene ontology and pathway annotation networks , 2009, Bioinform..

[17]  H. Yao,et al.  cAMP Activates MAP Kinase and Elk-1 through a B-Raf- and Rap1-Dependent Pathway , 1997, Cell.

[18]  M. Gerstein,et al.  A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data , 2003, Science.

[19]  Philip S. Yu,et al.  A new method to measure the semantic similarity of GO terms , 2007, Bioinform..

[20]  Robert Gentleman,et al.  Using GOstats to test gene lists for GO term association , 2007, Bioinform..

[21]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[22]  Chad Creighton,et al.  Mining gene expression databases for association rules , 2003, Bioinform..

[23]  Dong Xu,et al.  Genome-Scale Protein Function Prediction in Yeast Saccharomyces cerevisiae Through Integrating Multiple Sources of High-Throughput Data , 2005, Pacific Symposium on Biocomputing.

[24]  H. Lehrach,et al.  A Human Protein-Protein Interaction Network: A Resource for Annotating the Proteome , 2005, Cell.

[25]  Søren Brunak,et al.  Prediction of human protein function according to Gene Ontology categories , 2003, Bioinform..

[26]  Sanghamitra Bandyopadhyay,et al.  A new path based hybrid measure for gene ontology similarity , 2014, TCBB.

[27]  Haruki Nakamura,et al.  Filtering high-throughput protein-protein interaction data using a combination of genomic features , 2005, BMC Bioinformatics.

[28]  Geoffrey J. Barton,et al.  GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes , 2004, BMC Bioinformatics.