NoisyGOA: Noisy GO annotations prediction using taxonomic and semantic similarity

Gene Ontology (GO) provides GO annotations (GOA) that associate gene products with GO terms that summarize their cellular, molecular and functional aspects in the context of biological pathways. GO Consortium (GOC) resorts to various quality assurances to ensure the correctness of annotations. Due to resources limitations, only a small portion of annotations are manually added/checked by GO curators, and a large portion of available annotations are computationally inferred. While computationally inferred annotations provide greater coverage of known genes, they may also introduce annotation errors (noise) that could mislead the interpretation of the gene functions and their roles in cellular and biological processes. In this paper, we investigate how to identify noisy annotations, a rarely addressed problem, and propose a novel approach called NoisyGOA. NoisyGOA first measures taxonomic similarity between ontological terms using the GO hierarchy and semantic similarity between genes. Next, it leverages the taxonomic similarity and semantic similarity to predict noisy annotations. We compare NoisyGOA with other alternative methods on identifying noisy annotations under different simulated cases of noisy annotations, and on archived GO annotations. NoisyGOA achieved higher accuracy than other alternative methods in comparison. These results demonstrated both taxonomic similarity and semantic similarity contribute to the identification of noisy annotations. Our study shows that annotation errors are predictable and removing noisy annotations improves the performance of gene function prediction. This study can prompt the community to study methods for removing inaccurate annotations, a critical step for annotating gene and pathway functions.

[1]  Sidahmed Benabderrahmane,et al.  IntelliGO: a new vector-based semantic similarity measure including annotation origin , 2010, BMC Bioinformatics.

[2]  Hailong Zhu,et al.  Predicting protein functions using incomplete hierarchical labels , 2015, BMC Bioinformatics.

[3]  Philip S. Yu,et al.  G-Bean: an ontology-graph based web tool for biomedical literature retrieval , 2013, 2013 IEEE International Conference on Bioinformatics and Biomedicine.

[4]  Xiaoyan Liu,et al.  Measuring gene functional similarity based on group-wise comparison of GO terms , 2013, Bioinform..

[5]  Zhigang Chen,et al.  An Integrated Framework for Functional Annotation of Protein Structural Domains , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[7]  Judith A. Blake,et al.  On the Use of Gene Ontology Annotations to Assess Functional Similarity among Orthologs and Paralogs: A Short Report , 2012, PLoS Comput. Biol..

[8]  María Martín,et al.  Ongoing and future developments at the Universal Protein Resource , 2010, Nucleic Acids Res..

[9]  Carol Friedman,et al.  Information theory applied to the sparse gene ontology annotation network to predict novel gene function , 2007, ISMB/ECCB.

[10]  Safaai Deris,et al.  A genetic similarity algorithm for searching the Gene Ontology terms and annotating anonymous protein sequences , 2008, J. Biomed. Informatics.

[11]  Haixuan Yang,et al.  Improving GO semantic similarity measures by exploring the ontology beneath the terms and modelling uncertainty , 2012, Bioinform..

[12]  Tony Sawford,et al.  Understanding how and why the Gene Ontology and its annotations evolve: the GO within UniProt , 2014, GigaScience.

[13]  Philip S. Yu,et al.  A new method to measure the semantic similarity of GO terms , 2007, Bioinform..

[14]  Bo Yang,et al.  NegGOA: negative GO annotations selection using ontology structure , 2016, Bioinform..

[15]  Phillip W. Lord,et al.  Semantic Similarity in Biomedical Ontologies , 2009, PLoS Comput. Biol..

[16]  Christophe Dessimoz,et al.  Quality of Computationally Inferred Gene Ontology Annotations , 2012, PLoS Comput. Biol..

[17]  Xiaomei Wu,et al.  Prediction of yeast protein–protein interaction network: insights from the Gene Ontology and annotations , 2006, Nucleic acids research.

[18]  Anthony Santella,et al.  A semi-local neighborhood-based framework for probabilistic cell lineage tracing , 2014, BMC Bioinformatics.

[19]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[20]  Catia Pesquita,et al.  Metrics for GO based protein semantic similarity: a systematic evaluation , 2008, BMC Bioinformatics.

[21]  R. Sharan,et al.  Network-based prediction of protein function , 2007, Molecular systems biology.

[22]  Jiming Liu,et al.  Predicting protein function via downward random walks on a gene ontology , 2015, BMC Bioinformatics.

[23]  Giorgio Valentini,et al.  True Path Rule Hierarchical Ensembles for Genome-Wide Gene Function Prediction , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[24]  Michael W. Berry,et al.  Using a literature-based NMF model for discovering gene functional relationships , 2008, 2008 IEEE International Conference on Bioinformatics and Biomeidcine Workshops.

[25]  Hiroaki Kitano,et al.  The PANTHER database of protein families, subfamilies, functions and pathways , 2004, Nucleic Acids Res..

[26]  Karin M. Verspoor,et al.  A close look at protein function prediction evaluation protocols , 2015, GigaScience.

[27]  Michal Linial,et al.  The automated function prediction SIG looks back at 2013 and prepares for 2014 , 2014, Bioinform..

[28]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[29]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[30]  Zheng Guo,et al.  Broadly predicting specific gene functions with expression similarity and taxonomy similarity. , 2005, Gene.

[31]  Zhiwen Yu,et al.  Protein Function Prediction with Incomplete Annotations , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[32]  Luca de Alfaro,et al.  The Gene Wiki in 2011: community intelligence applied to human gene annotation , 2011, Nucleic Acids Res..

[33]  Janna Hastings,et al.  Exploiting disjointness axioms to improve semantic similarity measures , 2013, Bioinform..

[34]  Robert D. Finn,et al.  InterPro in 2011: new developments in the family and domain prediction database , 2011, Nucleic acids research.

[35]  Guido Jenster,et al.  CGtag: complete genomics toolkit and annotation in a cloud-based Galaxy , 2014, GigaScience.

[36]  Benjamin M. Good,et al.  Crowdsourcing for bioinformatics , 2013, Bioinform..

[37]  K. Dolinski,et al.  Use and misuse of the gene ontology annotations , 2008, Nature Reviews Genetics.

[38]  Judith A. Blake,et al.  Ten Quick Tips for Using the Gene Ontology , 2013, PLoS Comput. Biol..

[39]  Mario Cannataro,et al.  Semantic similarity analysis of protein data: assessment with biological features and issues , 2012, Briefings Bioinform..

[40]  Patricia C. Babbitt,et al.  Biases in the Experimental Annotations of Protein Function and Their Effect on Our Understanding of Protein Function Space , 2013, PLoS Comput. Biol..

[41]  B. Schwikowski,et al.  A network of protein–protein interactions in yeast , 2000, Nature Biotechnology.

[42]  Dennis Shasha,et al.  Negative Example Selection for Protein Function Prediction: The NoGO Database , 2014, PLoS Comput. Biol..

[43]  Paul Pavlidis,et al.  Gene Ontology term overlap as a measure of gene functional similarity , 2008, BMC Bioinformatics.

[44]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[45]  David A. Lee,et al.  Predicting protein function from sequence and structure , 2007, Nature Reviews Molecular Cell Biology.

[46]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.