Categorizer: a tool to categorize genes into user-defined biological groups based on semantic similarity

BackgroundCommunalities between large sets of genes obtained from high-throughput experiments are often identified by searching for enrichments of genes with the same Gene Ontology (GO) annotations. The GO analysis tools used for these enrichment analyses assume that GO terms are independent and the semantic distances between all parent–child terms are identical, which is not true in a biological sense. In addition these tools output lists of often redundant or too specific GO terms, which are difficult to interpret in the context of the biological question investigated by the user. Therefore, there is a demand for a robust and reliable method for gene categorization and enrichment analysis.ResultsWe have developed Categorizer, a tool that classifies genes into user-defined groups (categories) and calculates p-values for the enrichment of the categories. Categorizer identifies the biologically best-fit category for each gene by taking advantage of a specialized semantic similarity measure for GO terms. We demonstrate that Categorizer provides improved categorization and enrichment results of genetic modifiers of Huntington’s disease compared to a classical GO Slim-based approach or categorizations using other semantic similarity measures.ConclusionCategorizer enables more accurate categorizations of genes than currently available methods. This new tool will help experimental and computational biologists analyzing genomic and proteomic data according to their specific needs in a more reliable manner.

[1]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[2]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[3]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[4]  M. MacDonald,et al.  Amyloid Formation by Mutant Huntingtin: Threshold, Progressivity and Recruitment of Normal Polyglutamine Proteins , 1998, Somatic cell and molecular genetics.

[5]  D. Housman,et al.  The Huntington's disease protein interacts with p53 and CREB-binding protein and represses transcription. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[6]  C A Ross,et al.  Interference by Huntingtin and Atrophin-1 with CBP-Mediated Transcription Leading to Cellular Toxicity , 2001, Science.

[7]  He Li,et al.  Interaction of Huntington Disease Protein with Transcriptional Activator Sp1 , 2002, Molecular and Cellular Biology.

[8]  J. Gusella,et al.  The predominantly HEAT-like motif structure of huntingtin and its association and coincident nuclear entry with dorsal, an NF-kB/Rel/dorsal family transcription factor , 2002, BMC Neuroscience.

[9]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[10]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[11]  A. Danchin,et al.  Bmc Genomics , 2004 .

[12]  Andreas Zell,et al.  A memetic clustering algorithm for the functional partition of genes based on the gene ontology , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[13]  Kimberly Van Auken,et al.  WormBase: a multi-species resource for nematode biology and genomics , 2004, Nucleic Acids Res..

[14]  Xiao-Jiang Li,et al.  Huntingtin-protein interactions and the pathogenesis of Huntington's disease. , 2004, Trends in genetics : TIG.

[15]  Kenneth H. Buetow,et al.  Gene functional similarity search tool (GFSST) , 2006, BMC Bioinformatics.

[16]  Martin Kuiper,et al.  BiNGO: a Cytoscape plugin to assess overrepresentation of Gene Ontology categories in Biological Networks , 2005, Bioinform..

[17]  Hai Hu,et al.  Assessing semantic similarity measures for the characterization of human regulatory pathways , 2006, Bioinform..

[18]  Thomas Lengauer,et al.  GOTax: investigating biological processes and biochemical activities along the taxonomic tree , 2007, Genome Biology.

[19]  Alfonso Valencia,et al.  Defining functional distances over Gene Ontology , 2008, BMC Bioinformatics.

[20]  J. Shorter Hsp104: A Weapon to Combat Diverse Neurodegenerative Disorders , 2007, Neurosignals.

[21]  Philip S. Yu,et al.  A new method to measure the semantic similarity of GO terms , 2007, Bioinform..

[22]  Leslie Michels Thompson,et al.  Inhibition of specific HDACs and sirtuins suppresses pathogenesis in a Drosophila model of Huntington's disease. , 2008, Human molecular genetics.

[23]  M. Yamaguchi,et al.  Heat Shock Transcription Factor 1-activating Compounds Suppress Polyglutamine-induced Neurodegeneration through Induction of Multiple Molecular Chaperones* , 2008, Journal of Biological Chemistry.

[24]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[25]  L. Ukani,et al.  Comparative analysis of genetic modifiers in Drosophila points to common and distinct mechanisms of pathogenesis among polyglutamine diseases. , 2008, Human molecular genetics.

[26]  N. Nukina,et al.  RNAi Screening in Drosophila Cells Identifies New Modifiers of Mutant Huntingtin Aggregation , 2009, PloS one.

[27]  Marie-Dominique Devignes,et al.  Gene–disease relationship discovery based on model-driven data integration and database view definition , 2008, Bioinform..

[28]  Phillip W. Lord,et al.  Semantic Similarity in Biomedical Ontologies , 2009, PLoS Comput. Biol..

[29]  H. Vogel,et al.  Drosophila models of neurodegenerative diseases. , 2009, Annual review of pathology.

[30]  Philip S. Yu,et al.  G-SESAME: web tools for GO-term-based gene similarity analysis and knowledge discovery , 2009, Nucleic Acids Res..

[31]  Frederick P. Roth,et al.  Next generation software for functional trend analysis , 2009, Bioinform..

[32]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[33]  Thomas Lengauer,et al.  Improving disease gene prioritization using the semantic similarity of Gene Ontology terms , 2010, Bioinform..

[34]  N. Perrimon,et al.  A Genomewide RNA Interference Screen for Modifiers of Aggregates Formation by Mutant Huntingtin in Drosophila , 2010, Genetics.

[35]  Mário J. Silva,et al.  Disjunctive shared information between ontology concepts: application to Gene Ontology , 2011, J. Biomed. Semant..

[36]  J. D. Mills,et al.  Alternative splicing of mRNA in the molecular pathology of neurodegenerative diseases , 2012, Neurobiology of Aging.

[37]  Nicola J. Mulder,et al.  A Topology-Based Metric for Measuring Term Similarity in the Gene Ontology , 2012, Adv. Bioinformatics.

[38]  J. Yates,et al.  Proteomic Analysis of Wild-type and Mutant Huntingtin-associated Proteins in Mouse Brains Identifies Unique Interactions and Involvement in Protein Synthesis* , 2012, The Journal of Biological Chemistry.

[39]  Nicola J. Mulder,et al.  DaGO-Fun: tool for Gene Ontology-based functional analysis using term information content measures , 2013, BMC Bioinformatics.

[40]  Nicola J. Mulder,et al.  Information Content-Based Gene Ontology Semantic Similarity Approaches: Toward a Unified Framework Theory , 2013, BioMed research international.

[41]  Chunyu Wang,et al.  A novel insight into Gene Ontology semantic similarity. , 2013, Genomics.

[42]  D. Rubinsztein,et al.  NeuroGeM, a knowledgebase of genetic modifiers in neurodegenerative diseases , 2013, BMC Medical Genomics.

[43]  Xiaomei Wu,et al.  Improving the Measurement of Semantic Similarity between Gene Ontology Terms and Gene Products: Insights from an Edge- and IC-Based Hybrid Method , 2013, PloS one.

[44]  Kimberly Glass,et al.  Annotation Enrichment Analysis: An Alternative Method for Evaluating the Functional Properties of Gene Sets , 2012, Scientific Reports.