Parameterizing sequence alignment with an explicit evolutionary model

BackgroundInference of sequence homology is inherently an evolutionary question, dependent upon evolutionary divergence. However, the insertion and deletion penalties in the most widely used methods for inferring homology by sequence alignment, including BLAST and profile hidden Markov models (profile HMMs), are not based on any explicitly time-dependent evolutionary model. Using one fixed score system (BLOSUM62 with some gap open/extend costs, for example) corresponds to making an unrealistic assumption that all sequence relationships have diverged by the same time. Adoption of explicit time-dependent evolutionary models for scoring insertions and deletions in sequence alignments has been hindered by algorithmic complexity and technical difficulty.ResultsWe identify and implement several probabilistic evolutionary models compatible with the affine-cost insertion/deletion model used in standard pairwise sequence alignment. Assuming an affine gap cost imposes important restrictions on the realism of the evolutionary models compatible with it, as single insertion events with geometrically distributed lengths do not result in geometrically distributed insert lengths at finite times. Nevertheless, we identify one evolutionary model compatible with symmetric pair HMMs that are the basis for Smith-Waterman pairwise alignment, and two evolutionary models compatible with standard profile-based alignment.We test different aspects of the performance of these “optimized branch length” models, including alignment accuracy and homology coverage (discrimination of residues in a homologous region from nonhomologous flanking residues). We test on benchmarks of both global homologies (full length sequence homologs) and local homologies (homologous subsequences embedded in nonhomologous sequence).ConclusionsContrary to our expectations, we find that for global homologies a single long branch parameterization suffices both for distant and close homologous relationships. In contrast, we do see an advantage in using explicit evolutionary models for local homologies. Optimal branch parameterization reduces a known artifact called “homologous overextension”, in which local alignments erroneously extend through flanking nonhomologous residues.

[1]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[2]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[3]  S. Jeffery Evolution of Protein Molecules , 1979 .

[4]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[5]  M. Bishop,et al.  Evolutionary trees from nucleic acid and protein sequences , 1985, Proceedings of the Royal Society of London. Series B. Biological Sciences.

[6]  M. Bishop,et al.  Maximum likelihood alignment of DNA sequences. , 1986, Journal of molecular biology.

[7]  G. Churchill Stochastic models for heterogeneous DNA sequences. , 1989, Bulletin of mathematical biology.

[8]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[9]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[10]  W. Pearson Comparison of methods for searching protein sequence databases , 1995, Protein science : a publication of the Protein Society.

[11]  S. Muse Evolutionary analyses of DNA sequences subject to constraints of secondary structure. , 1995, Genetics.

[12]  J. Felsenstein,et al.  A Hidden Markov Model approach to variation among sites in rate of evolution. , 1996, Molecular biology and evolution.

[13]  Gapped BLAST and PSI-BLAST: A new , 1997 .

[14]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[15]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[16]  G. Mitchison A Probabilistic Treatment of Phylogeny and Sequence Alignment , 1999, Journal of Molecular Evolution.

[17]  J. Hein,et al.  Statistical alignment: computational properties, homology testing and goodness-of-fit. , 2000, Journal of molecular biology.

[18]  W R Pearson,et al.  Flexible sequence similarity searching with the FASTA3 program package. , 2000, Methods in molecular biology.

[19]  G. Church,et al.  Conservation of DNA regulatory motifs and discovery of new motifs in microbial genomes. , 2000, Genome research.

[20]  Martin Vingron,et al.  Modeling Amino Acid Replacement , 2000, J. Comput. Biol..

[21]  S. Whelan,et al.  A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. , 2001, Molecular biology and evolution.

[22]  Jotun Hein,et al.  An Algorithm for Statistical Alignment of Sequences Related by a Binary Tree , 2000, Pacific Symposium on Biocomputing.

[23]  Zoltán Toroczkai,et al.  An Improved Model for Statistical Alignment , 2001, WABI.

[24]  R. Spang,et al.  Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. , 2002, Molecular biology and evolution.

[25]  William R. Pearson,et al.  Empirical determination of effective gap penalties for sequence comparison , 2002, Bioinform..

[26]  Bin Qian,et al.  Detecting distant homologs using phylogenetic tree‐based HMMs , 2003, Proteins.

[27]  M. Miyamoto,et al.  Sequence alignments and pair hidden Markov models using evolutionary history. , 2003, Journal of molecular biology.

[28]  Elena Rivas,et al.  Evolutionary models for insertions and deletions in a probabilistic modeling framework , 2005, BMC Bioinformatics.

[29]  J. Felsenstein,et al.  Inching toward reality: An improved likelihood model of sequence evolution , 2004, Journal of Molecular Evolution.

[30]  R. Durbin,et al.  Tree-based maximal likelihood substitution matrices and hidden Markov models , 1995, Journal of Molecular Evolution.

[31]  I. Holmes,et al.  A "Long Indel" model for evolutionary sequence alignment. , 2003, Molecular biology and evolution.

[32]  Ziheng Yang Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods , 1994, Journal of Molecular Evolution.

[33]  S. Altschul A protein alignment scoring system sensitive at all evolutionary distances , 1993, Journal of Molecular Evolution.

[34]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[35]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[36]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[37]  István Miklós,et al.  Statistical Alignment: Recent Progress, New Applications, and Challenges , 2005 .

[38]  Lode Wyns,et al.  SABmark- a benchmark for sequence alignment that covers the entire known fold space , 2005, Bioinform..

[39]  Jun Wang,et al.  MCALIGN2: Faster, accurate global pairwise alignment of non-coding DNA sequences based on explicit models of indel evolution , 2006, BMC Bioinformatics.

[40]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[41]  J. Felsenstein,et al.  An evolutionary model for maximum likelihood alignment of DNA sequences , 1991, Journal of Molecular Evolution.

[42]  M. Hasegawa,et al.  Model of amino acid substitution in proteins encoded by mitochondrial DNA , 1996, Journal of Molecular Evolution.

[43]  Gerton Lunter,et al.  Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes , 2007, ISMB/ECCB.

[44]  Sean R. Eddy,et al.  A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation , 2008, PLoS Comput. Biol..

[45]  O. Gascuel,et al.  An improved general amino acid replacement matrix. , 2008, Molecular biology and evolution.

[46]  Elena Rivas,et al.  Probabilistic Phylogenetic Inference with Insertions and Deletions , 2008, PLoS Comput. Biol..

[47]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[48]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[49]  Kevin Karplus,et al.  SAM-T08, HMM-based protein structure prediction , 2009, Nucleic Acids Res..

[50]  R. Cartwright Problems and solutions for estimating indel rates and length distributions. , 2009, Molecular biology and evolution.

[51]  W. Pearson,et al.  Homologous over-extension: a challenge for iterative similarity searches , 2010, Nucleic acids research.

[52]  Yongchao Liu,et al.  MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities , 2010, Bioinform..

[53]  Robert C. Edgar,et al.  Quality measures for protein alignment benchmarks , 2010, Nucleic acids research.

[54]  Robert D. Finn,et al.  HMMER web server: interactive sequence similarity searching , 2011, Nucleic Acids Res..

[55]  Michele Magrane,et al.  UniProt Knowledgebase: a hub of integrated protein data , 2011, Database J. Biol. Databases Curation.

[56]  Tandy J. Warnow,et al.  FASTSP: linear time calculation of alignment accuracy , 2011, Bioinform..

[57]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[58]  Michael I. Jordan,et al.  Evolutionary inference via the Poisson Indel Process , 2012, Proceedings of the National Academy of Sciences.

[59]  M. Petz,et al.  La enhances IRES-mediated translation of laminin B1 during malignant epithelial to mesenchymal transition , 2011, Nucleic acids research.

[60]  William R Pearson,et al.  Selecting the Right Similarity‐Scoring Matrix , 2013, Current protocols in bioinformatics.

[61]  Robert D. Finn,et al.  Dfam: a database of repetitive DNA based on profile hidden Markov models , 2012, Nucleic Acids Res..

[62]  William R. Pearson,et al.  Adjusting scoring matrices to correct overextended alignments , 2013, Bioinform..

[63]  A. von Haeseler,et al.  Assessing Variability by Joint Sampling of Alignments and Mutation Rates , 2001, Journal of Molecular Evolution.

[64]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[65]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..