Real or fake? Measuring the impact of protein annotation errors on estimates of domain gain and loss events

Protein annotation errors can have significant consequences in a wide range of fields, ranging from protein structure and function prediction to biomedical research, drug discovery, and biotechnology. By comparing the domains of different proteins, scientists can identify common domains, classify proteins based on their domain architecture, and highlight proteins that have evolved differently in one or more species or clades. However, genome-wide identification of different protein domain architectures involves a complex error-prone pipeline that includes genome sequencing, prediction of gene exon/intron structures, and inference of protein sequences and domain annotations. Here we developed an automated fact-checking approach to distinguish true domain loss/gain events from false events caused by errors that occur during the annotation process. Using genome-wide ortholog sets and taking advantage of the high-quality human and Saccharomyces cerevisiae genome annotations, we analyzed the domain gain and loss events in the predicted proteomes of 9 non-human primates (NHP) and 20 non-S. cerevisiae fungi (NSF) as annotated in the Uniprot and Interpro databases. Our approach allowed us to quantify the impact of errors on estimates of protein domain gains and losses, and we show that domain losses are over-estimated ten-fold and three-fold in the NHP and NSF proteins respectively. This is in line with previous studies of gene-level losses, where issues with genome sequencing or gene annotation led to genes being falsely inferred as absent. In addition, we show that insistent protein domain annotations are a major factor contributing to the false events. For the first time, to our knowledge, we show that domain gains are also over-estimated by three-fold and two-fold respectively in NHP and NSF proteins. Based on our more accurate estimates, we infer that true domain losses and gains in NHP with respect to humans are observed at similar rates, while domain gains in the more divergent NSF are observed twice as frequently as domain losses with respect to S. cerevisiae. This study highlights the need to critically examine the scientific validity of protein annotations, and represents a significant step toward scalable computational fact-checking methods that may 1 day mitigate the propagation of wrong information in protein databases.

[1]  Cathy H. Wu,et al.  UniProt: the Universal Protein Knowledgebase in 2023 , 2022, Nucleic acids research.

[2]  Zheng Wang,et al.  Placing human gene families into their evolutionary context , 2022, Human Genomics.

[3]  Jonathan M. Mudge,et al.  Ensembl 2023 , 2022, Nucleic Acids Res..

[4]  Karin M. Verspoor,et al.  Propagation, detection and correction of errors using the sequence database network , 2022, Briefings Bioinform..

[5]  T. Gabaldón,et al.  Using genomics to understand the mechanisms of virulence and drug resistance in fungal pathogens , 2022, Biochemical Society transactions.

[6]  M. Cox,et al.  Reconstruction of gene innovation associated with major evolutionary transitions in the kingdom Fungi , 2022, BMC biology.

[7]  T. Marquès-Bonet,et al.  Initiation of the Primate Genome Project , 2022, Zoological research.

[8]  M. Emborg,et al.  Modeling genetic diseases in nonhuman primates through embryonic and germline modification: Considerations and challenges , 2022, Science Translational Medicine.

[9]  A. Murray,et al.  Mixing genome annotation methods in a comparative analysis inflates the apparent number of lineage-specific genes , 2022, Current Biology.

[10]  M. Engqvist,et al.  Experimental and computational investigation of enzyme functional annotations uncovers misannotation in the EC 1.1.3.15 enzyme class , 2021, PLoS Comput. Biol..

[11]  Felipe A. Simão,et al.  BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes , 2021, Molecular biology and evolution.

[12]  Jacob L. Steenwyk,et al.  A genome-scale phylogeny of the kingdom Fungi , 2021, Current Biology.

[13]  Z. Xue,et al.  Protein domain identification methods and online resources , 2021, Computational and structural biotechnology journal.

[14]  Silvio C. E. Tosatto,et al.  The InterPro protein families and domains database: 20 years on , 2020, Nucleic Acids Res..

[15]  Silvio C. E. Tosatto,et al.  Pfam: The protein families database in 2021 , 2020, Nucleic Acids Res..

[16]  Hao Wu,et al.  Structures of a Complete Human V-ATPase Reveal Mechanisms of Its Assembly. , 2020, Molecular cell.

[17]  J. Thompson,et al.  Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes , 2020, BMC Bioinformatics.

[18]  Natasha M. Glover,et al.  The Quest for Orthologs benchmark service and consensus calls in 2020 , 2020, Nucleic Acids Res..

[19]  E. Bornberg-Bauer,et al.  The modular nature of protein evolution: domain rearrangement rates across eukaryotic life , 2020, BMC Evolutionary Biology.

[20]  Steven L. Salzberg,et al.  Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank , 2020, Genome Biology.

[21]  Predrag Radivojac,et al.  The ortholog conjecture revisited: the value of orthologs and paralogs in function prediction , 2019, bioRxiv.

[22]  Erik L. L. Sonnhammer,et al.  Domainoid: domain-oriented orthology inference , 2019, BMC Bioinformatics.

[23]  A. Bateman,et al.  Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases , 2019, Nucleic acids research.

[24]  Berend Snel,et al.  Measuring the impact of gene prediction on gene loss estimates in Eukaryotes by quantifying falsely inferred absences , 2019, PLoS Comput. Biol..

[25]  D. Sculley,et al.  Using deep learning to annotate the protein universe , 2019, Nature Biotechnology.

[26]  T. Gabaldón,et al.  Fungal evolution: major ecological adaptations and evolutionary transitions , 2019, Biological reviews of the Cambridge Philosophical Society.

[27]  Malay Kumar Basu,et al.  Grammar of protein domain architectures , 2019, Proceedings of the National Academy of Sciences.

[28]  A. von Haeseler,et al.  The Evolutionary Traceability of a Protein , 2019, Genome biology and evolution.

[29]  Olivier Poch,et al.  OrthoInspector 3.0: open portal for comparative genomics , 2018, Nucleic Acids Res..

[30]  Juan Carlos Castilla-Rubio,et al.  Earth BioGenome Project: Sequencing life for the future of life , 2018, Proceedings of the National Academy of Sciences.

[31]  Xun Xu,et al.  10KP: A phylodiverse genome sequencing plan , 2018, GigaScience.

[32]  T. James,et al.  Early Diverging Fungi: Diversity and Impact at the Dawn of Terrestrial Life. , 2017, Annual review of microbiology.

[33]  Jesualdo Tomás Fernández-Breis,et al.  Gearing up to handle the mosaic nature of life in the quest for orthologs , 2017, Bioinform..

[34]  Paolo Piazza,et al.  Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis , 2017, F1000Research.

[35]  L. Patthy,et al.  Putative extremely high rate of proteome innovation in lancelets might be explained by high rate of gene prediction errors , 2016, Scientific Reports.

[36]  Juan Antonio Vizcaíno,et al.  Analysis of the Protein Domain and Domain Architecture Content in Fungi and Its Application in the Search of New Antifungal Targets , 2014, PLoS Comput. Biol..

[37]  R. Gibbs,et al.  Comparative primate genomics: emerging patterns of genome content and dynamics , 2014, Nature Reviews Genetics.

[38]  Ramanathan Sowdhamini,et al.  An alignment-free domain architecture similarity search (ADASS) algorithm for inferring homology between multi-domain proteins , 2013, Bioinformation.

[39]  E. Koonin,et al.  Functional and evolutionary implications of gene orthology , 2013, Nature Reviews Genetics.

[40]  T. Matsuzawa,et al.  Primates , 2012, Current Biology.

[41]  Erich Bornberg-Bauer,et al.  The Dynamics and Evolutionary Potential of Domain Loss and Emergence , 2011, Molecular biology and evolution.

[42]  Erik L. L. Sonnhammer,et al.  Domain architecture conservation in orthologs , 2011, BMC Bioinformatics.

[43]  E. Szarka,et al.  Reassessing Domain Architecture Evolution of Metazoan Proteins: Major Impact of Gene Prediction Errors , 2011, Genes.

[44]  Kimmen Sjölander,et al.  Ortholog identification in the presence of domain architecture rearrangement , 2011, Briefings Bioinform..

[45]  Wendell A. Lim,et al.  Rapid Diversification of Cell Signaling Phenotypes by Modular Domain Recombination , 2010, Science.

[46]  Patricia C. Babbitt,et al.  Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies , 2009, PLoS Comput. Biol..

[47]  A. Bateman,et al.  The evolution of protein domain families. , 2009, Biochemical Society transactions.

[48]  Lei Zhu,et al.  An initial strategy for comparing proteins at the domain architecture level , 2006, Bioinform..

[49]  Dannie Durand,et al.  Graph Theoretical Insights into Evolution of Multidomain Proteins , 2005, RECOMB.

[50]  Antonis Rokas,et al.  Parallel inactivation of multiple GAL pathway genes and ecological diversification in yeasts. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[51]  OUP accepted manuscript , 2022, Nucleic Acids Research.

[52]  Anatoliy Kuznetsov,et al.  NCBI Genome Workbench: Desktop Software for Comparative Genomics, Visualization, and GenBank Data Submission. , 2021, Methods in molecular biology.

[53]  Yannis Nevers,et al.  Orthology: Promises and Challenges , 2020, Evolutionary Biology—A Transdisciplinary Approach.

[54]  K. Au,et al.  Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis. , 2017, F1000Research.

[55]  E. Sonnhammer,et al.  Evolution of protein domain architectures. , 2012, Methods in molecular biology.