Onto-CC: a web server for identifying Gene Ontology conceptual clusters

The Gene Ontology (GO) vocabulary has been extensively explored to analyze the functions of coexpressed genes. However, despite its extended use in Biology and Medical Sciences, there are still high levels of uncertainty about which ontology (i.e. Molecular Process, Cellular Component or Molecular Function) should be used, and at which level of specificity. Moreover, the GO database can contain incomplete information resulting from human annotations, or highly influenced by the available knowledge about a specific branch in an ontology. In spite of these drawbacks, there is a trend to ignore these problems and even use GO terms to conduct searches of gene expression profiles (i.e. expression + GO) instead of more cautious approaches that just consider them as an independent source of validation (i.e. expression versus GO). Consequently, propagating the uncertainty and producing biased analysis of the required gene grouping hypotheses. We proposed a web tool, Onto-CC, as an automatic method specially suited for independent explanation/validation of gene grouping hypotheses (e.g. coexpressed genes) based on GO clusters (i.e. expression versus GO). Onto-CC approach reduces the uncertainty of the queries by identifying optimal conceptual clusters that combine terms from different ontologies simultaneously, as well as terms defined at different levels of specificity in the GO hierarchy. To do so, we implemented the EMO-CC methodology to find clusters in structural databases [GO Directed acyclic Graph (DAG) tree], inspired on Conceptual Clustering algorithms. This approach allows the management of optimal cluster sets as potential parallel hypotheses, guided by multiobjective/multimodal optimization techniques. Therefore, we can generate alternative and, still, optimal explanations of queries that can provide new insights for a given problem. Onto-CC has been successfully used to test different medical and biological hypotheses including the explanation and prediction of gene expression profiles resulting from the host response to injuries in the inflammatory problem. Onto-CC provides two versions: Ready2GO, a precalculated EMO-CC for several genomes and an Advanced Onto-CC for custom annotation files (http://gps-tools2.wustl.edu/onto-cc/index.html).

[1]  D.J. Cook,et al.  Structural mining of molecular biology data , 2001, IEEE Engineering in Medicine and Biology Magazine.

[2]  Purvesh Khatri,et al.  Ontological analysis of gene expression data: current tools, limitations, and open problems , 2005, Bioinform..

[3]  D. Cook,et al.  Graph-based hierarchical conceptual clustering , 2002 .

[4]  Erin Beck,et al.  The comprehensive microbial resource , 2000, Nucleic Acids Res..

[5]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[6]  Matthew Berriman,et al.  GeneDB: a resource for prokaryotic and eukaryotic organisms , 2004, Nucleic Acids Res..

[7]  Sankar K. Pal,et al.  Pattern Recognition: From Classical to Modern Approaches , 2001 .

[8]  J. Brune,et al.  Structural features in a brittle–ductile wax model of continental extension , 1997, nature.

[9]  James G. R. Gilbert,et al.  The vertebrate genome annotation (Vega) database , 2004, Nucleic Acids Res..

[10]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[11]  Kimberly Van Auken,et al.  WormBase: new content and better access , 2006, Nucleic Acids Res..

[12]  T. Beißbarth,et al.  Interpreting experimental results using gene ontologies. , 2006, Methods in enzymology.

[13]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[14]  M. Eisen,et al.  Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering , 2002, Genome Biology.

[15]  Madeline A. Crosby,et al.  FlyBase: genomes by the dozen , 2006, Nucleic Acids Res..

[16]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[17]  Joaquín Dopazo,et al.  The role of the environment in Parkinson's disease. , 1996, Nucleic Acids Res..

[18]  Henry Huang,et al.  Analysis of differentially-regulated genes within a regulatory network by GPS genome navigation , 2005, Bioinform..

[19]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[20]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[21]  David S. Wishart,et al.  MovieMaker: a web server for rapid rendering of protein motions and interactions , 2005, Nucleic Acids Res..

[22]  Dean Cheng,et al.  Pseudomonas aeruginosa Genome Database and PseudoCAP: facilitating community-based, continually updated, genome annotation , 2004, Nucleic Acids Res..

[23]  J. Liu,et al.  Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. , 2001, Nucleic acids research.

[24]  Eric M. Just,et al.  dictyBase, the model organism database for Dictyostelium discoideum , 2005, Nucleic Acids Res..

[25]  Jungwon Yoon,et al.  The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community , 2003, Nucleic Acids Res..

[26]  Judith A. Blake,et al.  The Mouse Genome Database (MGD): from genes to mice—a community resource for mouse biology , 2004, Nucleic Acids Res..

[27]  Oscar Cordón,et al.  A Multiobjective Evolutionary Conceptual Clustering Methodology for Gene Annotation Within Structural Databases: A Case of Study on the Gene Ontology Database , 2008, IEEE Transactions on Evolutionary Computation.

[28]  Monte Westerfield,et al.  The Zebrafish Information Network: the zebrafish model organism database , 2005, Nucleic Acids Res..

[29]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[30]  J. Collado-Vides,et al.  Identifying global regulators in transcriptional regulatory networks in bacteria. , 2003, Current opinion in microbiology.

[31]  Purvesh Khatri,et al.  Onto-Tools, the toolkit of the modern biologist: Onto-Express, Onto-Compare, Onto-Design and Onto-Translate , 2003, Nucleic Acids Res..

[32]  Christophe G. Giraud-Carrier,et al.  A Note on the Utility of Incremental Learning , 2000, AI Commun..

[33]  John D. Storey,et al.  A network-based analysis of systemic inflammation in humans , 2005, Nature.

[34]  Steven Guan,et al.  An incremental approach to genetic-algorithms-based classification , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[35]  Marek S. Skrzypek,et al.  The Candida Genome Database (CGD), a community resource for Candida albicans gene and protein information , 2004, Nucleic Acids Res..

[36]  Mary Shimoyama,et al.  The Rat Genome Database, update 2007—Easing the path from disease to data and back again , 2006, Nucleic Acids Res..

[37]  Claudine Médigue,et al.  MICheck: a web tool for fast checking of syntactic annotations of bacterial genomes , 2005, Nucleic Acids Res..

[38]  C. Ball,et al.  Genetic and physical maps of Saccharomyces cerevisiae. , 1997, Nature.

[39]  Monte Westerfield,et al.  The Zebrafish Information Network (ZFIN): the zebrafish model organism database , 2003, Nucleic Acids Res..