Rock, Paper, Scissors: Harnessing Complementarity in Ortholog Detection Methods Improves Comparative Genomic Inference

Ortholog detection (OD) is a lynchpin of most statistical methods in comparative genomics. This task involves accurately identifying genes across species that descend from a common ancestral sequence. OD methods comprise a wide variety of approaches, each with their own benefits and costs under a variety of evolutionary and practical scenarios. In this article, we examine the proteomes of ten mammals by using four methodologically distinct, rigorously filtered OD methods. In head-to-head comparisons, we find that these algorithms significantly outperform one another for 38–45% of the genes analyzed. We leverage this high complementarity through the development MOSAIC, or Multiple Orthologous Sequence Analysis and Integration by Cluster optimization, the first tool for integrating methodologically diverse OD methods. Relative to the four methods examined, MOSAIC more than quintuples the number of alignments for which all species are present while simultaneously maintaining or improving functional-, phylogenetic-, and sequence identity-based measures of ortholog quality. Further, this improvement in alignment quality yields more confidently aligned sites and higher levels of overall conservation, while simultaneously detecting of up to 180% more positively selected sites. We close by highlighting a MOSAIC-specific positively selected sites near the active site of TPSAB1, an enzyme linked to asthma, heart disease, and irritable bowel disease. MOSAIC alignments, source code, and full documentation are available at http://pythonhosted.org/bio-MOSAIC.

[1]  M. J. van der Laan,et al.  The International Journal of Biostatistics Collaborative Double Robust Targeted Maximum Likelihood Estimation , 2011 .

[2]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[3]  Alexandros Stamatakis,et al.  Time and memory efficient likelihood-based tree searches on phylogenomic alignments with missing data , 2010, Bioinform..

[4]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[5]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[6]  Gang Liu,et al.  Automatic clustering of orthologs and inparalogs shared by multiple proteomes , 2006, ISMB.

[7]  Dannie Durand,et al.  How old is my gene? , 2013, Trends in genetics : TIG.

[8]  Jo McEntyre,et al.  The NCBI Handbook , 2002 .

[9]  Jeet Sukumaran,et al.  DendroPy: a Python library for phylogenetic computing , 2010, Bioinform..

[10]  Dannie Durand,et al.  Inferring duplications, losses, transfers and incomplete lineage sorting with nonbinary species trees , 2012, Bioinform..

[11]  N. Trivedi,et al.  Mast Cell α and β Tryptases Changed Rapidly during Primate Speciation and Evolved from γ-Like Transmembrane Peptidases in Ancestral Vertebrates1 , 2007, The Journal of Immunology.

[12]  Christophe Dessimoz,et al.  Phylogenetic and Functional Assessment of Orthologs Inference Projects and Methods , 2009, PLoS Comput. Biol..

[13]  M. Yandell,et al.  A beginner's guide to eukaryotic genome annotation , 2012, Nature Reviews Genetics.

[14]  S. Henikoff,et al.  Amino acid substitution matrices. , 2000, Advances in protein chemistry.

[15]  N. Trivedi,et al.  Mast cell alpha and beta tryptases changed rapidly during primate speciation and evolved from gamma-like transmembrane peptidases in ancestral vertebrates. , 2007, Journal of immunology.

[16]  Melissa J. Landrum,et al.  RefSeq: an update on mammalian reference sequences , 2013, Nucleic Acids Res..

[17]  W. Maddison,et al.  Inferring phylogeny despite incomplete lineage sorting. , 2006, Systematic biology.

[18]  J. Spurlino,et al.  Potent, nonpeptide inhibitors of human mast cell tryptase. Synthesis and biological evaluation of novel spirocyclic piperidine amide derivatives. , 2008, Bioorganic & medicinal chemistry letters.

[19]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[20]  Christian E. V. Storm,et al.  Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. , 2001, Journal of molecular biology.

[21]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[22]  Gaston H. Gonnet,et al.  OMA 2011: orthology inference among 1000 complete genomes , 2010, Nucleic Acids Res..

[23]  Joakim S. Dahlin,et al.  Mouse Mast Cell Protease-6 and MHC Are Involved in the Development of Experimental Asthma , 2014, The Journal of Immunology.

[24]  Jonathan M. Mudge,et al.  The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. , 2009, Genome research.

[25]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[26]  B. Snel,et al.  Toward Automatic Reconstruction of a Highly Resolved Tree of Life , 2006, Science.

[27]  Leszek P. Pryszcz,et al.  MetaPhOrs: orthology and paralogy predictions from multiple phylogenetic evidence using a consistency-based confidence score , 2010, Nucleic acids research.

[28]  P. Kovanen,et al.  Mast Cells as Effectors in Atherosclerosis , 2015, Arteriosclerosis, thrombosis, and vascular biology.

[29]  M. Huynen,et al.  Benchmarking ortholog identification methods using functional genomics data , 2006, Genome Biology.

[30]  Ziheng Yang PAML 4: phylogenetic analysis by maximum likelihood. , 2007, Molecular biology and evolution.

[31]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[32]  Nick Goldman,et al.  The effects of alignment error and alignment filtering on the sitewise detection of positive selection. , 2012, Molecular biology and evolution.

[33]  Mark J van der Laan,et al.  Super Learning: An Application to the Prediction of HIV-1 Drug Resistance , 2007, Statistical applications in genetics and molecular biology.

[34]  C. Sommerhoff,et al.  Mast cell tryptase beta as a target in allergic inflammation: an evolving story. , 2007, Current pharmaceutical design.

[35]  A. Rokas,et al.  Evaluating Ortholog Prediction Algorithms in a Yeast Model Clade , 2011, PloS one.

[36]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[37]  M. Sinnamon,et al.  Essential role for mast cell tryptase in acute experimental colitis , 2010, Proceedings of the National Academy of Sciences.

[38]  M. J. van der Laan,et al.  Statistical Applications in Genetics and Molecular Biology Super Learner , 2010 .

[39]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[40]  J. Reifman,et al.  QuartetS: a fast and accurate algorithm for large-scale orthology detection , 2011, Nucleic acids research.

[41]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[42]  T. Massingham,et al.  Detecting Amino Acid Sites Under Positive Selection and Purifying Selection , 2005, Genetics.

[43]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[44]  Michael I. Jordan,et al.  Computational and statistical tradeoffs via convex relaxation , 2012, Proceedings of the National Academy of Sciences.

[45]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[46]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[47]  Richard N. Armstrong,et al.  Large-Scale Determination of Sequence, Structure, and Function Relationships in Cytosolic Glutathione Transferases across the Biosphere , 2014, PLoS biology.

[48]  Ingo Ebersberger,et al.  HaMStR: Profile hidden markov model based search for orthologs in ESTs , 2009, BMC Evolutionary Biology.

[49]  M. Stone An Asymptotic Equivalence of Choice of Model by Cross‐Validation and Akaike's Criterion , 1977 .

[50]  M. Kondo,et al.  Serum B12 Tryptase Level as a Marker of Allergic Airway Inflammation in Asthma , 2002, The Journal of asthma : official journal of the Association for the Care of Asthma.

[51]  Eyal Akiva,et al.  Prediction and characterization of enzymatic activities guided by sequence similarity and genome neighborhood networks , 2014, eLife.

[52]  Yongchao Liu,et al.  MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities , 2010, Bioinform..

[53]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[54]  S. Pongor,et al.  The quest for orthologs: finding the corresponding gene across genomes. , 2008, Trends in genetics : TIG.

[55]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[56]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[57]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.