GOcats: A tool for categorizing Gene Ontology into subgraphs of user-defined concepts

Gene Ontology is used extensively in scientific knowledgebases and repositories to organize a wealth of biological information. However, interpreting annotations derived from differential gene lists is often difficult without manually sorting into higher-order categories. To address these issues, we present GOcats, a novel tool that organizes the Gene Ontology (GO) into subgraphs representing user-defined concepts, while ensuring that all appropriate relations are congruent with respect to scoping semantics. We tested GOcats performance using subcellular location categories to mine annotations from GO-utilizing knowledgebases and evaluated their accuracy against immunohistochemistry datasets in the Human Protein Atlas (HPA). In comparison to term categorizations generated from UniProt’s controlled vocabulary and from GO slims via OWLTools’ Map2Slim, GOcats outperformed these methods in its ability to mimic human-categorized GO term sets. Unlike the other methods, GOcats relies only on an input of basic keywords from the user (e.g. biologist), not a manually compiled or static set of top-level GO terms. Additionally, by identifying and properly defining relations with respect to semantic scope, GOcats can utilize the traditionally problematic relation, has_part, without encountering erroneous term mapping. We applied GOcats in the comparison of HPA-sourced knowledgebase annotations to experimentally-derived annotations provided by HPA directly. During the comparison, GOcats improved correspondence between the annotation sources by adjusting semantic granularity. GOcats enables the creation of custom, GO slim-like filters to map fine-grained gene annotations from gene annotation files to general subcellular compartments without needing to hand-select a set of GO terms for categorization. Moreover, GOcats can customize the level of semantic specificity for annotation categories. Furthermore, GOcats enables a safe and more comprehensive semantic scoping utilization of go-core, allowing for a more complete utilization of information available in GO. Together, these improvements can impact a variety of GO knowledgebase data mining use-cases as well as knowledgebase curation and quality control.

[1]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[2]  Hunter N. B. Moseley,et al.  A Lexical Approach to Identifying Subtype Inconsistencies in Biomedical Terminologies , 2018, 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[3]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[4]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[5]  Thomas Lengauer,et al.  A new measure for functional similarity of gene products based on Gene Ontology , 2006, BMC Bioinformatics.

[6]  H. Son,et al.  Categorizer: a tool to categorize genes into user-defined biological groups based on semantic similarity , 2014, BMC Genomics.

[7]  Fred L. Drake,et al.  The Python Language Reference Manual , 1999 .

[8]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[9]  G. von Heijne,et al.  Tissue-based map of the human proteome , 2015, Science.

[10]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[11]  Júlio Cesar dos Reis,et al.  Mapping adaptation actions for the automatic reconciliation of dynamic ontologies , 2013, CIKM.

[12]  Júlio Cesar dos Reis,et al.  Semi-automatic Adaptation of Mappings between Life Science Ontologies , 2013, DILS.

[13]  Erhard Rahm,et al.  Evolution of biomedical ontologies and mappings: Overview of recent approaches , 2016, Computational and structural biotechnology journal.

[14]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[15]  Hunter N. B. Moseley,et al.  Auditing subtype inconsistencies among gene ontology concepts , 2017, 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[16]  Edward L. Huttlin,et al.  The BioPlex Network: A Systematic Exploration of the Human Interactome , 2015, Cell.

[17]  Juancarlos Chan,et al.  Gene Ontology Consortium: going forward , 2014, Nucleic Acids Res..

[18]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[19]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[20]  Rachael P. Huntley,et al.  QuickGO: a web-based tool for Gene Ontology searching , 2009, Bioinform..

[21]  Irene Papatheodorou,et al.  Linking gene expression to phenotypes via pathway information , 2015, J. Biomed. Semant..

[22]  Hanspeter Pfister,et al.  UpSet: Visualization of Intersecting Sets , 2014, IEEE Transactions on Visualization and Computer Graphics.

[23]  Shuzhao Li,et al.  Blood transcriptomics and metabolomics for personalized medicine , 2015, Computational and structural biotechnology journal.

[24]  Tamás Korcsmáros,et al.  ComPPI: a cellular compartment-specific database for protein–protein interaction network analysis , 2014, Nucleic Acids Res..

[25]  Hunter N. B. Moseley,et al.  Advances in gene ontology utilization improve statistical power of annotation enrichment , 2018, bioRxiv.

[26]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.