The ortholog conjecture revisited: the value of orthologs and paralogs in function prediction

The computational prediction of gene function is a key step in making full use of newly sequenced genomes. Function is generally predicted by transferring annotations from homologous genes or proteins for which experimental evidence exists. The “ortholog conjecture” proposes that orthologous genes should be preferred when making such predictions, as they evolve functions more slowly than paralogous genes. Previous research has provided little support for the ortholog conjecture, though the incomplete nature of the data cast doubt on the conclusions. Here we use experimental annotations from over 40,000 proteins, drawn from over 80,000 publications, to revisit the ortholog conjecture in two pairs of species: (i) Homo sapiens and Mus musculus and (ii) Saccharomyces cerevisiae and Schizosaccharomyces pombe. By making a distinction between questions about the evolution of function versus questions about the prediction of function, we find strong evidence against the ortholog conjecture in the context of function prediction, though questions about the evolution of function remain difficult to address. In both pairs of species, we quantify the amount of data that must be ignored if paralogs are discarded, as well as the resulting loss in prediction accuracy. Taken as a whole, our results support the view that the types of homologs used for function transfer are largely irrelevant to the task of function prediction. Aiming to maximize the amount of data used for this task, regardless of whether it comes from orthologs or paralogs, is most likely to lead to higher prediction accuracy.

[1]  J. Mainland,et al.  Functional Evolution of Mammalian Odorant Receptors , 2012, PLoS genetics.

[2]  E. Koonin Orthologs, Paralogs, and Evolutionary Genomics 1 , 2005 .

[3]  Marc Robinson-Rechavi,et al.  Tissue-Specificity of Gene Expression Diverges Slowly between Orthologs, and Rapidly between Paralogs , 2016, bioRxiv.

[4]  Maria Jesus Martin,et al.  Big data and other challenges in the quest for orthologs , 2014, Bioinform..

[5]  P. Wittkopp,et al.  Tempo and mode of regulatory evolution in Drosophila , 2014, Genome research.

[6]  The Gene Ontology Consortium Expansion of the Gene Ontology knowledgebase and resources , 2016, Nucleic Acids Res..

[7]  Christophe Dessimoz,et al.  Resolving the Ortholog Conjecture: Orthologs Tend to Be Weakly, but Significantly, More Similar in Function than Paralogs , 2012, PLoS Comput. Biol..

[8]  Burkhard Rost,et al.  Protein–Protein Interactions More Conserved within Species than across Species , 2006, PLoS Comput. Biol..

[9]  D. Nicolae,et al.  Rapid divergence in expression between duplicate genes inferred from microarray data. , 2002, Trends in genetics : TIG.

[10]  Robert Kofler,et al.  Sequencing of Pooled DNA Samples (Pool-Seq) Uncovers Complex Dynamics of Transposable Element Insertions in Drosophila melanogaster , 2012, PLoS genetics.

[11]  Chao Zhang,et al.  ASTRAL-Pro: Quartet-Based Species-Tree Inference despite Paralogy , 2019, bioRxiv.

[12]  Predrag Radivojac,et al.  The impact of incomplete knowledge on the evaluation of protein function prediction: a structured-output learning perspective , 2014, Bioinform..

[13]  Prudence Mutowo-Meullenet,et al.  The GOA database: Gene Ontology annotation updates for 2015 , 2014, Nucleic Acids Res..

[14]  Debra Goldberg,et al.  Questioning the Ubiquity of Neofunctionalization , 2009, PLoS Comput. Biol..

[15]  Tandy Warnow,et al.  Polynomial-Time Statistical Estimation of Species Trees under Gene Duplication and Loss , 2019, bioRxiv.

[16]  E. Koonin,et al.  Functional and evolutionary implications of gene orthology , 2013, Nature Reviews Genetics.

[17]  Antonis Rokas,et al.  Functional divergence for every paralog. , 2014, Molecular biology and evolution.

[18]  Judith A. Blake,et al.  On the Use of Gene Ontology Annotations to Assess Functional Similarity among Orthologs and Paralogs: A Short Report , 2012, PLoS Comput. Biol..

[19]  Mona Singh,et al.  Pervasive Variation of Transcription Factor Orthologs Contributes to Regulatory Network Evolution , 2014, PLoS genetics.

[20]  Predrag Radivojac,et al.  Influence of Sequence Changes and Environment on Intrinsically Disordered Proteins , 2009, PLoS Comput. Biol..

[21]  Christophe Dessimoz,et al.  CAFA and the open world of protein function predictions. , 2013, Trends in genetics : TIG.

[22]  Predrag Radivojac,et al.  Testing the Ortholog Conjecture with Comparative Functional Genomic Data from Mammals , 2011, PLoS Comput. Biol..

[23]  Wen-Hsiung Li,et al.  Divergence in the spatial pattern of gene expression between human duplicate genes. , 2003, Genome research.

[24]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[25]  Lenore Cowen,et al.  When should we NOT transfer functional annotation between sequence paralogs? , 2017, PSB.

[26]  Ben-Yang Liao,et al.  Accumulation of CTCF-binding sites drives expression divergence between tandemly duplicated genes in humans , 2014, BMC Genomics.

[27]  W. Fitch Distinguishing homologous from analogous proteins. , 1970, Systematic zoology.

[28]  Patricia C. Babbitt,et al.  Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies , 2009, PLoS Comput. Biol..

[29]  Christophe Dessimoz,et al.  Quality of Computationally Inferred Gene Ontology Annotations , 2012, PLoS Comput. Biol..

[30]  Tapio Salakoski,et al.  An expanded evaluation of protein function prediction methods shows an improvement in accuracy , 2016, Genome Biology.

[31]  D. Botstein,et al.  Orthology and functional conservation in eukaryotes. , 2007, Annual review of genetics.

[32]  Thomas Lengauer,et al.  A new measure for functional similarity of gene products based on Gene Ontology , 2006, BMC Bioinformatics.

[33]  Hannah Currant,et al.  FFPred 3: feature-based function prediction for all Gene Ontology domains , 2016, Scientific Reports.

[34]  Sheng Li,et al.  An optimized algorithm for detecting and annotating regional differential methylation , 2013, BMC Bioinformatics.

[35]  Galina V. Glazko,et al.  The choice of optimal distance measure in genome-wide datasets , 2005, Bioinform..

[36]  Svetlana A. Shabalina,et al.  Gene Family Level Comparative Analysis of Gene Expression in Mammals Validates the Ortholog Conjecture , 2014, Genome biology and evolution.

[37]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[38]  A. Sali,et al.  Evolutionary constraints on structural similarity in orthologs and paralogs , 2009, Protein science : a publication of the Protein Society.

[39]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Slobodan Vucetic,et al.  MS-kNN: protein function prediction by integrating multiple data sources , 2013, BMC Bioinformatics.

[41]  Tapio Salakoski,et al.  The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens , 2019, Genome Biology.

[42]  Zheng Sun,et al.  PANDA: Protein function prediction using domain architecture and affinity propagation , 2018, Scientific Reports.

[43]  David T Jones,et al.  Computational Methods for Annotation Transfers from Sequence. , 2016, Methods in molecular biology.

[44]  Predrag Radivojac,et al.  A new class of metrics for learning on real-valued and structured data , 2016, Data Mining and Knowledge Discovery.

[45]  E. Koonin Orthologs, paralogs, and evolutionary genomics. , 2005, Annual review of genetics.

[46]  P. Radivojac,et al.  Analysis of protein function and its prediction from amino acid sequence , 2011, Proteins.

[47]  Ashley I. Teufel,et al.  Humanization of yeast genes with multiple human orthologs reveals principles of functional divergence between paralogs , 2019, bioRxiv.

[48]  Xiaoshu Chen,et al.  The Ortholog Conjecture Is Untestable by the Current Gene Ontology but Is Supported by RNA Sequencing Data , 2012, PLoS Comput. Biol..

[49]  Michael I. Jordan,et al.  Protein Molecular Function Prediction by Bayesian Phylogenomics , 2005, PLoS Comput. Biol..

[50]  Luay Nakhleh,et al.  Species Tree Inference under the Multispecies Coalescent on Data with Paralogs is Accurate , 2018, bioRxiv.

[51]  M. Robinson‐Rechavi,et al.  How confident can we be that orthologs are similar, but paralogs differ? , 2009, Trends in genetics : TIG.

[52]  Felipe Zapata,et al.  Pairwise comparisons across species are problematic when analyzing functional genomic data , 2018, Proceedings of the National Academy of Sciences.

[53]  Daisuke Kihara,et al.  Enhanced automated function prediction using distantly related sequences and contextual association by PFP , 2006, Protein science : a publication of the Protein Society.

[54]  Michael I. Jordan,et al.  Genome-scale phylogenetic function annotation of large and diverse protein families. , 2011, Genome research.

[55]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[56]  Mark Gerstein,et al.  Getting Started in Gene Orthology and Functional Analysis , 2010, PLoS Comput. Biol..

[57]  Predrag Radivojac,et al.  Information-theoretic evaluation of predicted ontological annotations , 2013, Bioinform..

[58]  Karin M. Verspoor,et al.  Combining heterogeneous data sources for accurate functional annotation of proteins , 2013, BMC Bioinformatics.

[59]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .