A Penalized-Likelihood Method to Estimate the Distribution of Selection Coefficients from Phylogenetic Data

We develop a maximum penalized-likelihood (MPL) method to estimate the fitnesses of amino acids and the distribution of selection coefficients (S = 2Ns) in protein-coding genes from phylogenetic data. This improves on a previous maximum-likelihood method. Various penalty functions are used to penalize extreme estimates of the fitnesses, thus correcting overfitting by the previous method. Using a combination of computer simulation and real data analysis, we evaluate the effect of the various penalties on the estimation of the fitnesses and the distribution of S. We show the new method regularizes the estimates of the fitnesses for small, relatively uninformative data sets, but it can still recover the large proportion of deleterious mutations when present in simulated data. Computer simulations indicate that as the number of taxa in the phylogeny or the level of sequence divergence increases, the distribution of S can be more accurately estimated. Furthermore, the strength of the penalty can be varied to study how informative a particular data set is about the distribution of S. We analyze three protein-coding genes (the chloroplast rubisco protein, mammal mitochondrial proteins, and an influenza virus polymerase) and show the new method recovers a large proportion of deleterious mutations in these data, even under strong penalties, confirming the distribution of S is bimodal in these real data. We recommend the use of the new MPL approach for the estimation of the distribution of S in species phylogenies of protein-coding genes.

[1]  Junhyong Kim,et al.  Taxon sampling affects inferences of macroevolutionary processes from phylogenetic trees. , 2008, Systematic biology.

[2]  Su Yeon Kim,et al.  Adaptive Evolution of Conserved Noncoding Elements in Mammals , 2007, PLoS genetics.

[3]  D D Pollock,et al.  Assessing an unknown evolutionary process: effect of increasing site-specific knowledge through taxon addition. , 2000, Molecular biology and evolution.

[4]  Gail J. Bartlett,et al.  Analysis of catalytic residues in enzyme active sites. , 2002, Journal of molecular biology.

[5]  K. Holsinger The neutral theory of molecular evolution , 2004 .

[6]  J. Bloom,et al.  Mutational effects on stability are largely conserved during protein evolution , 2013, Proceedings of the National Academy of Sciences.

[7]  Rafael Sanjuán,et al.  Mutational fitness effects in RNA and single-stranded DNA viruses: common patterns revealed by site-directed mutagenesis studies , 2010, Philosophical Transactions of the Royal Society B: Biological Sciences.

[8]  S Karlin,et al.  Measures of residue density in protein structures. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[9]  W R Taylor,et al.  Coevolving protein residues: maximum likelihood identification and relationship to structure. , 1999, Journal of molecular biology.

[10]  E. Steyerberg,et al.  [Regression modeling strategies]. , 2011, Revista espanola de cardiologia.

[11]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[12]  Derrick J. Zwickl,et al.  Increased taxon sampling greatly reduces phylogenetic error. , 2002, Systematic biology.

[13]  Stanley I. Grossman Elementary Linear Algebra , 1980 .

[14]  M. Holder,et al.  Evaluating the robustness of phylogenetic methods to among-site variability in substitution processes , 2008, Philosophical Transactions of the Royal Society B: Biological Sciences.

[15]  T. Ohta THE NEARLY NEUTRAL THEORY OF MOLECULAR EVOLUTION , 1992 .

[16]  Nick Goldman,et al.  What's in a likelihood? Simple models of protein evolution and the contribution of structurally viable reconstructions to the likelihood. , 2011, Systematic biology.

[17]  Sunil J Rao,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2003 .

[18]  Carlos Bustamante,et al.  Population Genetics of Molecular Evolution , 2005 .

[19]  Alexandros Stamatakis,et al.  Maximum Likelihood Analyses of 3,490 rbcL Sequences: Scalability of Comprehensive Inference versus Group-Specific Taxon Sampling , 2010, Evolutionary bioinformatics online.

[20]  J. Jensen,et al.  Experimental illumination of a fitness landscape , 2011, Proceedings of the National Academy of Sciences.

[21]  R H Borts,et al.  Direct estimate of the mutation rate and the distribution of fitness effects in the yeast Saccharomyces cerevisiae. , 2001, Genetics.

[22]  N. Rodrigue On the Statistical Interpretation of Site-Specific Variables in Phylogeny-Based Substitution Models , 2013, Genetics.

[23]  D. Bolon,et al.  Experimental illumination of a fitness landscape , 2011, Proceedings of the National Academy of Sciences.

[24]  A. G. Pedersen,et al.  Computational Molecular Evolution , 2013 .

[25]  M. Sanderson Estimating absolute rates of molecular evolution and divergence times: a penalized likelihood approach. , 2002, Molecular biology and evolution.

[26]  Hirohisa Kishino,et al.  Population genetics without intraspecific data. , 2007, Molecular biology and evolution.

[27]  Ziheng Yang,et al.  Mutation-selection models of codon substitution and their use to estimate selective strengths on codon usage. , 2008, Molecular biology and evolution.

[28]  Richard A. Goldstein,et al.  Identifying Changes in Selective Constraints: Host Shifts in Influenza , 2009, PLoS Comput. Biol..

[29]  Thomas Ludwig,et al.  RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees , 2005, Bioinform..

[30]  W. Taylor,et al.  Effectiveness of correlation analysis in identifying protein residues undergoing correlated evolution. , 1997, Protein engineering.

[31]  T. Jukes,et al.  The neutral theory of molecular evolution. , 2000, Genetics.

[32]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[33]  A. Halpern,et al.  Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. , 1998, Molecular biology and evolution.

[34]  W. Li,et al.  Maintenance of Genetic Variability under the Joint Effect of Mutation, Selection and Random Drift. , 1978, Genetics.

[35]  P. Keightley,et al.  A Comparison of Models to Infer the Distribution of Fitness Effects of New Mutations , 2013, Genetics.

[36]  H. Akashi,et al.  Within- and between-species DNA sequence variation and the 'footprint' of natural selection. , 1999, Gene.

[37]  Richard A. Goldstein,et al.  Estimating the Distribution of Selection Coefficients from Phylogenetic Data Using Sitewise Mutation-Selection Models , 2012, Genetics.

[38]  Nicolas Lartillot,et al.  Site-heterogeneous mutation-selection models within the PhyloBayes-MPI package , 2013, Bioinform..

[39]  D. Cox,et al.  Asymptotic Analysis of Penalized Likelihood and Related Estimators , 1990 .

[40]  Ziheng Yang,et al.  Bayesian estimation of species divergence times under a molecular clock using multiple fossil calibrations with soft bounds. , 2006, Molecular biology and evolution.

[41]  R. Nielsen,et al.  Site-by-site estimation of the rate of substitution and the correlation of rates in mitochondrial DNA. , 1997, Systematic biology.

[42]  Ziheng Yang,et al.  Estimating the distribution of selection coefficients from phylogenetic data with applications to mitochondrial and viral DNA. , 2003, Molecular biology and evolution.

[43]  Ziheng Yang PAML 4: phylogenetic analysis by maximum likelihood. , 2007, Molecular biology and evolution.

[44]  Hervé Philippe,et al.  Mutation-selection models of coding sequence evolution with site-heterogeneous amino acid fitness profiles , 2010, Proceedings of the National Academy of Sciences.

[45]  J. Thorne,et al.  Codon models as a vehicle for reconciling population genetics with inter-specific sequence data , 2012 .

[46]  S. Cusack,et al.  Influenza A Virus Polymerase: Structural Insights into Replication and Host Adaptation Mechanisms* , 2010, The Journal of Biological Chemistry.