Inferring Indel Parameters using a Simulation-based Approach

In this study, we present a novel methodology to infer indel parameters from multiple sequence alignments (MSAs) based on simulations. Our algorithm searches for the set of evolutionary parameters describing indel dynamics which best fits a given input MSA. In each step of the search, we use parametric bootstraps and the Mahalanobis distance to estimate how well a proposed set of parameters fits input data. Using simulations, we demonstrate that our methodology can accurately infer the indel parameters for a large variety of plausible settings. Moreover, using our methodology, we show that indel parameters substantially vary between three genomic data sets: Mammals, bacteria, and retroviruses. Finally, we demonstrate how our methodology can be used to simulate MSAs based on indel parameters inferred from real data sets.

[1]  David T. Jones,et al.  Protein evolution with dependence among codons due to tertiary structure. , 2003, Molecular biology and evolution.

[2]  N. Goldman,et al.  A codon-based model of nucleotide substitution for protein-coding DNA sequences. , 1994, Molecular biology and evolution.

[3]  John P. Overington,et al.  HOMSTRAD: A database of protein structure alignments for homologous families , 1998, Protein science : a publication of the Protein Society.

[4]  Yun S. Song,et al.  An Efficient Algorithm for Statistical Multiple Alignment on Arbitrary Phylogenetic Trees , 2003, J. Comput. Biol..

[5]  Simon Whelan,et al.  Phylogenetic substitution models for detecting heterotachy during plastid evolution. , 2011, Molecular biology and evolution.

[6]  D. Bryant,et al.  Site interdependence attributed to tertiary structure in amino acid sequence evolution. , 2005, Gene.

[7]  J. Bull,et al.  EXPERIMENTAL MOLECULAR EVOLUTION OF BACTERIOPHAGE T7 , 1993, Evolution; international journal of organic evolution.

[8]  Adi Stern,et al.  An evolutionary space-time model with varying among-site dependencies. , 2006, Molecular biology and evolution.

[9]  J. Felsenstein,et al.  An evolutionary model for maximum likelihood alignment of DNA sequences , 1991, Journal of Molecular Evolution.

[10]  D. Haussler,et al.  Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. , 2003, Molecular biology and evolution.

[11]  István Miklós,et al.  Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs , 2015, BMC Bioinformatics.

[12]  M. Steel,et al.  Recovering evolutionary trees under a more realistic model of sequence evolution. , 1994, Molecular biology and evolution.

[13]  Tal Pupko,et al.  GUIDANCE2: accurate detection of unreliable alignment regions accounting for the uncertainty of multiple parameters , 2015, Nucleic Acids Res..

[14]  A. Criscuolo morePhyML: improving the phylogenetic tree space exploration with PhyML 3. , 2011, Molecular phylogenetics and evolution.

[15]  M. Suchard,et al.  Joint Bayesian estimation of alignment and phylogeny. , 2005, Systematic biology.

[16]  Eric T. Dawson,et al.  Limited utility of residue masking for positive-selection inference. , 2014, Molecular biology and evolution.

[17]  Adi Stern,et al.  Evolutionary Modeling of Rate Shifts Reveals Specificity Determinants in HIV-1 Subtypes , 2008, PLoS Comput. Biol..

[18]  R. Cartwright Problems and solutions for estimating indel rates and length distributions. , 2009, Molecular biology and evolution.

[19]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[20]  Laurent Guéguen,et al.  Accurate estimation of substitution rates with neighbor-dependent models in a phylogenetic context. , 2012, Systematic biology.

[21]  Michael I. Jordan,et al.  Evolutionary inference via the Poisson Indel Process , 2012, Proceedings of the National Academy of Sciences.

[22]  T. Gojobori,et al.  Methods for incorporating the hypermutability of CpG dinucleotides in detecting natural selection operating at the amino acid sequence level. , 2009, Molecular biology and evolution.

[23]  Adi Doron-Faigenboim,et al.  Evolutionary models accounting for layers of selection in protein-coding genes and their impact on the inference of positive selection. , 2011, Molecular biology and evolution.

[24]  Simon Whelan,et al.  Measuring the distance between multiple sequence alignments , 2012, Bioinform..

[25]  J. Bielawski Detecting the Signatures of Adaptive Evolution in Protein‐Coding Genes , 2013, Current protocols in molecular biology.

[26]  Ari Löytynoja,et al.  Phylogeny-aware alignment with PRANK. , 2014, Methods in molecular biology.

[27]  N. Goldman,et al.  Codon-substitution models for heterogeneous selection pressure at amino acid sites. , 2000, Genetics.

[28]  Hanlin Gao,et al.  A Law of Mutation: Power Decay of Small Insertions and Small Deletions Associated with Human Diseases , 2010, Applied biochemistry and biotechnology.

[29]  Z. Yang,et al.  Among-site rate variation and its impact on phylogenetic analyses. , 1996, Trends in ecology & evolution.

[30]  Andrew J. Roger,et al.  A phylogenetic mixture model for the identification of functionally divergent protein residues , 2011, Bioinform..

[31]  Ioan-Iovitz Popescu,et al.  On a Zipf's Law extension to impact factors , 2003, Glottometrics.

[32]  Xun Gu,et al.  The size distribution of insertions and deletions in human and rodent pseudogenes suggests the logarithmic gap penalty for sequence alignment , 1995, Journal of Molecular Evolution.

[33]  Alexandre Z. Caldeira,et al.  Uncertainty in homology inferences: assessing and improving genomic sequence alignment. , 2008, Genome research.

[34]  Mark Gerstein,et al.  Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes. , 2003, Nucleic acids research.

[35]  Bernard M. E. Moret,et al.  Phylogenetic Inference , 2011, Encyclopedia of Parallel Computing.

[36]  Chris Field,et al.  Estimation of rates-across-sites distributions in phylogenetic substitution models. , 2003, Systematic biology.

[37]  A. Löytynoja,et al.  Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis , 2008, Science.

[38]  S. Hughes,et al.  Nature, Position, and Frequency of Mutations Made in a Single Cycle of HIV-1 Replication , 2010, Journal of Virology.

[39]  Kimberly G. Smith,et al.  A multivariate model of female black bear habitat use for a geographic information system , 1993 .

[40]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[41]  J. Hartigan,et al.  Asynchronous distance between homologous DNA sequences. , 1987, Biometrics.

[42]  Jotun Hein,et al.  Simultaneous Bayesian estimation of alignment and phylogeny under a joint model of protein sequence and structure. , 2014, Molecular biology and evolution.

[43]  Arndt von Haeseler,et al.  ImOSM: Intermittent Evolution and Robustness of Phylogenetic Methods , 2011, Molecular biology and evolution.

[44]  J. Thompson,et al.  Using CLUSTAL for multiple sequence alignments. , 1996, Methods in enzymology.

[45]  G. Gonnet,et al.  Empirical and structural models for insertions and deletions in the divergent evolution of proteins. , 1993, Journal of molecular biology.

[46]  E. Susko,et al.  The Site-Wise Log-Likelihood Score is a Good Predictor of Genes under Positive Selection , 2013, Journal of Molecular Evolution.

[47]  Tal Pupko,et al.  GUIDANCE: a web server for assessing alignment confidence scores , 2010, Nucleic Acids Res..

[48]  B. Redelings,et al.  Erasing errors due to alignment ambiguity when estimating positive selection. , 2014, Molecular biology and evolution.

[49]  Reed A. Cartwright,et al.  DNA assembly with gaps (Dawg): simulating sequence evolution , 2005, Bioinform..

[50]  Paul Marjoram,et al.  Markov chain Monte Carlo without likelihoods , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[51]  Nick Goldman,et al.  The effects of alignment error and alignment filtering on the sitewise detection of positive selection. , 2012, Molecular biology and evolution.

[52]  William H. Press,et al.  Numerical recipes in C , 2002 .

[53]  J. Bielawski,et al.  Recombination Detection Under Evolutionary Scenarios Relevant to Functional Divergence , 2011, Journal of Molecular Evolution.

[54]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[55]  J. Bielawski,et al.  Inference of Functional Divergence Among Proteins When the Evolutionary Process is Non-stationary , 2013, Journal of Molecular Evolution.

[56]  Dan Graur,et al.  Heads or tails: a simple reliability check for multiple sequence alignments. , 2007, Molecular biology and evolution.

[57]  Alexandros Stamatakis,et al.  Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees , 2011, BMC Bioinformatics.

[58]  Albert J. Vilella,et al.  Accurate extension of multiple sequence alignments using a phylogeny-aware graph algorithm , 2012, Bioinform..

[59]  E. Susko,et al.  A test for heterotachy using multiple pairs of sequences. , 2011, Molecular biology and evolution.

[60]  Z. Yang,et al.  A space-time process model for the evolution of DNA sequences. , 1995, Genetics.

[61]  Itay Mayrose,et al.  Probabilistic Methods and Rate Heterogeneity , 2010 .

[62]  C. Simon,et al.  Exploring among-site rate variation models in a maximum likelihood framework using empirical data: effects of model assumptions on estimates of topology, branch lengths, and bootstrap support. , 2001, Systematic biology.

[63]  Tal Pupko,et al.  Improving the performance of positive selection inference by filtering unreliable alignment regions. , 2012, Molecular biology and evolution.

[64]  Céline Scornavacca,et al.  OrthoMaM v8: a database of orthologous exons and coding sequences for comparative genomics in mammals. , 2014, Molecular biology and evolution.

[65]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[66]  Simon Whelan,et al.  Class of multiple sequence alignment algorithm affects genomic analysis. , 2013, Molecular biology and evolution.

[67]  Tal Pupko,et al.  Alignment errors strongly impact likelihood-based tests for comparing topologies. , 2014, Molecular biology and evolution.

[68]  Mattias Jakobsson,et al.  Deep divergences of human gene trees and models of human origins. , 2011, Molecular biology and evolution.

[69]  J. Hein,et al.  Statistical alignment: computational properties, homology testing and goodness-of-fit. , 2000, Journal of molecular biology.

[70]  Lucie M. Gattepaille,et al.  Demographic inferences using short‐read genomic data in an approximate Bayesian computation framework: in silico evaluation of power, biases and proof of concept in Atlantic walrus , 2015, Molecular ecology.

[71]  Ziheng Yang,et al.  The effect of insertions, deletions, and alignment errors on the branch-site test of positive selection. , 2010, Molecular biology and evolution.

[72]  Saraswathi Abhiman,et al.  Prediction of function divergence in protein families using the substitution rate variation parameter alpha. , 2006, Molecular biology and evolution.

[73]  Steven A Benner,et al.  Empirical analysis of protein insertions and deletions determining parameters for the correct placement of gaps in protein sequence alignments. , 2004, Journal of molecular biology.

[74]  O. Gascuel,et al.  New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. , 2010, Systematic biology.

[75]  J. Wägele,et al.  Long Branch Effects Distort Maximum Likelihood Phylogenies in Simulations Despite Selection of the Correct Model , 2012, PloS one.

[76]  Simon D W Frost,et al.  A simple hierarchical approach to modeling distributions of substitution rates. , 2005, Molecular biology and evolution.

[77]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[78]  I. Holmes,et al.  A "Long Indel" model for evolutionary sequence alignment. , 2003, Molecular biology and evolution.

[79]  István Miklós,et al.  Statistical Alignment: Recent Progress, New Applications, and Challenges , 2005 .

[80]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[81]  J. Felsenstein,et al.  Inching toward reality: An improved likelihood model of sequence evolution , 2004, Journal of Molecular Evolution.