WORMHOLE: Novel Least Diverged Ortholog Prediction through Machine Learning

The rapid advancement of technology in genomics and targeted genetic manipulation has made comparative biology an increasingly prominent strategy to model human disease processes. Predicting orthology relationships between species is a vital component of comparative biology. Dozens of strategies for predicting orthologs have been developed using combinations of gene and protein sequence, phylogenetic history, and functional interaction with progressively increasing accuracy. A relatively new class of orthology prediction strategies combines aspects of multiple methods into meta-tools, resulting in improved prediction performance. Here we present WORMHOLE, a novel ortholog prediction meta-tool that applies machine learning to integrate 17 distinct ortholog prediction algorithms to identify novel least diverged orthologs (LDOs) between 6 eukaryotic species—humans, mice, zebrafish, fruit flies, nematodes, and budding yeast. Machine learning allows WORMHOLE to intelligently incorporate predictions from a wide-spectrum of strategies in order to form aggregate predictions of LDOs with high confidence. In this study we demonstrate the performance of WORMHOLE across each combination of query and target species. We show that WORMHOLE is particularly adept at improving LDO prediction performance between distantly related species, expanding the pool of LDOs while maintaining low evolutionary distance and a high level of functional relatedness between genes in LDO pairs. We present extensive validation, including cross-validated prediction of PANTHER LDOs and evaluation of evolutionary divergence and functional similarity, and discuss future applications of machine learning in ortholog prediction. A WORMHOLE web tool has been developed and is available at http://wormhole.jax.org/.

[1]  Rafael C. Jimenez,et al.  The IntAct molecular interaction database in 2012 , 2011, Nucleic Acids Res..

[2]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.

[3]  Predrag Radivojac,et al.  Testing the Ortholog Conjecture with Comparative Functional Genomic Data from Mammals , 2011, PLoS Comput. Biol..

[4]  Verena Albert,et al.  mTOR in aging, metabolism, and cancer. , 2013, Current opinion in genetics & development.

[5]  Alain Denise,et al.  A meta-approach for improving the prediction and the functional annotation of ortholog groups , 2014, BMC Genomics.

[6]  Ryan D. Hernandez,et al.  Rock, Paper, Scissors: Harnessing Complementarity in Ortholog Detection Methods Improves Comparative Genomic Inference , 2015, G3: Genes, Genomes, Genetics.

[7]  Christophe Dessimoz,et al.  Phylogenetic and Functional Assessment of Orthologs Inference Projects and Methods , 2009, PLoS Comput. Biol..

[8]  Christophe Dessimoz,et al.  Resolving the Ortholog Conjecture: Orthologs Tend to Be Weakly, but Significantly, More Similar in Function than Paralogs , 2012, PLoS Comput. Biol..

[9]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[10]  Gaston H. Gonnet,et al.  OMA 2011: orthology inference among 1000 complete genomes , 2010, Nucleic Acids Res..

[11]  Alex Bateman,et al.  TreeFam v9: a new website, more species and orthology-on-the-fly , 2013, Nucleic Acids Res..

[12]  Klaus Peter Schliep,et al.  phangorn: phylogenetic analysis in R , 2010, Bioinform..

[13]  Damian Szklarczyk,et al.  eggNOG v4.0: nested orthology inference across 3686 organisms , 2013, Nucleic Acids Res..

[14]  Maria Jesus Martin,et al.  Big data and other challenges in the quest for orthologs , 2014, Bioinform..

[15]  Evgeny M. Zdobnov,et al.  OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs , 2012, Nucleic Acids Res..

[16]  Daniel R. Zerbino,et al.  Ensembl 2014 , 2013, Nucleic Acids Res..

[17]  Judith A. Blake,et al.  On the Use of Gene Ontology Annotations to Assess Functional Similarity among Orthologs and Paralogs: A Short Report , 2012, PLoS Comput. Biol..

[18]  Bonnie Berger,et al.  An integrative approach to ortholog prediction for disease-focused and other functional studies , 2011, BMC Bioinformatics.

[19]  A. Richardson,et al.  How longevity research can lead to therapies for Alzheimer's disease: The rapamycin story , 2015, Experimental Gerontology.

[20]  Jae-Yoon Jung,et al.  Roundup 2.0: enabling comparative genomics for over 1800 genomes , 2012, Bioinform..

[21]  G. Santulli,et al.  Tailoring mTOR-based therapy: molecular evidence and clinical challenges. , 2013, Pharmacogenomics.

[22]  Christie S. Chang,et al.  The BioGRID interaction database: 2013 update , 2012, Nucleic Acids Res..

[23]  Anushya Muruganujan,et al.  PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees , 2012, Nucleic Acids Res..

[24]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[25]  Andrew C. R. Martin,et al.  Automatically extracting functionally equivalent proteins from SwissProt , 2008, BMC Bioinformatics.

[26]  Leszek P. Pryszcz,et al.  MetaPhOrs: orthology and paralogy predictions from multiple phylogenetic evidence using a consistency-based confidence score , 2010, Nucleic acids research.

[27]  Adam J. Smith,et al.  The Database of Interacting Proteins: 2004 update , 2004, Nucleic Acids Res..

[28]  W. Fitch Distinguishing homologous from analogous proteins. , 1970, Systematic zoology.

[29]  Xiaoshu Chen,et al.  The Ortholog Conjecture Is Untestable by the Current Gene Ontology but Is Supported by RNA Sequencing Data , 2012, PLoS Comput. Biol..

[30]  Evan Bolton,et al.  Database resources of the National Center for Biotechnology Information , 2017, Nucleic Acids Res..

[31]  Ketil Malde,et al.  The effect of sequence quality on sequence alignment , 2008, Bioinform..

[32]  Livia Perfetto,et al.  MINT, the molecular interaction database: 2009 update , 2009, Nucleic Acids Res..

[33]  Olivier Poch,et al.  OrthoInspector 2.0: Software and database updates , 2015, Bioinform..

[34]  B. Labedan,et al.  Assessing the evolutionary rate of positional orthologous genes in prokaryotes using synteny data , 2007, BMC Evolutionary Biology.

[35]  Erik L. L. Sonnhammer,et al.  InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic , 2014, Nucleic Acids Res..

[36]  Sandhya Rani,et al.  Human Protein Reference Database—2009 update , 2008, Nucleic Acids Res..

[37]  Iva Greenwald,et al.  OrthoList: A Compendium of C. elegans Genes with Human Orthologs , 2011, PloS one.

[38]  Thomas Lengauer,et al.  A new measure for functional similarity of gene products based on Gene Ontology , 2006, BMC Bioinformatics.

[39]  Ioannis Xenarios,et al.  DIP: The Database of Interacting Proteins: 2001 update , 2001, Nucleic Acids Res..

[40]  Anushya Muruganujan,et al.  PANTHER version 7: improved phylogenetic trees, orthologs and collaboration with the Gene Ontology Consortium , 2009, Nucleic Acids Res..

[41]  R. E. Carlson,et al.  Monotone Piecewise Cubic Interpolation , 1980 .

[42]  M. Hall,et al.  Target of Rapamycin (TOR) in Nutrient Signaling and Growth Control , 2011, Genetics.

[43]  M. Hall,et al.  Rapamycin passes the torch: a new generation of mTOR inhibitors , 2011, Nature Reviews Drug Discovery.

[44]  Fabian Schreiber,et al.  Hieranoid: hierarchical orthology inference. , 2013, Journal of molecular biology.

[45]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[46]  Richard Simon,et al.  Bias in error estimation when using cross-validation for model selection , 2006, BMC Bioinformatics.