Probabilistic approaches to alignment with tandem repeats

BackgroundShort tandem repeats are ubiquitous in genomic sequences and due to their complex evolutionary history pose a challenge for sequence alignment tools.ResultsTo better account for the presence of tandem repeats in pairwise sequence alignments, we propose a simple tractable pair hidden Markov model that explicitly models their presence. Using the framework of gain functions, we design several optimization criteria for decoding this model and describe resulting decoding algorithms, ranging from the traditional Viterbi and posterior decoding to block-based decoding algorithms tailored to our model. We compare the accuracy of individual decoding algorithms on simulated and real data and find that our approach is superior to the classical three-state pair HMM.ConclusionsOur study illustrates versatility of pair hidden Markov models coupled with appropriate decoding criteria as a modeling tool for capturing complex sequence features.

[1]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[2]  Mary Goldman,et al.  The UCSC Genome Browser database: extensions and updates 2013 , 2012, Nucleic Acids Res..

[3]  G. Hong,et al.  Nucleic Acids Research , 2015, Nucleic Acids Research.

[4]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[5]  Kiyoshi Asai,et al.  Prediction of RNA secondary structure using generalized centroid estimators , 2009, Bioinform..

[6]  Alexandre Z. Caldeira,et al.  Uncertainty in homology inferences: assessing and improving genomic sequence alignment. , 2008, Genome research.

[7]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[8]  Tomás Vinar,et al.  Probabilistic Approaches to Alignment with Tandem Repeats , 2013, WABI.

[9]  Tomás Vinar,et al.  Aligning sequences with repetitive motifs , 2012, ITAT.

[10]  S. Miyazawa A reliable sequence alignment method based on probabilities of residue correspondences. , 1995, Protein engineering.

[11]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[12]  Mathieu Blanchette,et al.  A Probabilistic Model for Sequence Alignment with Context-Sensitive Indels , 2011, RECOMB.

[13]  Jotun Hein,et al.  Genome-wide functional element detection using pairwise statistical alignment outperforms multiple genome footprinting techniques , 2010, Bioinform..

[14]  Olivier Gascuel,et al.  A Fast and Specific Alignment Method for Minisatellite Maps , 2006, Evolutionary bioinformatics online.

[15]  M. Frith A new repeat-masking method enables specific detection of homologous sequences , 2010, Nucleic acids research.

[16]  Simon Cawley,et al.  Applications of generalized pair hidden Markov models to alignment and gene finding problems , 2001, J. Comput. Biol..

[17]  Laurent Gil,et al.  Ensembl 2013 , 2012, Nucleic Acids Res..

[18]  Alexander K. Hudek Improvements in the Accuracy of Pairwise Genomic Alignment , 2010 .

[19]  Alessandro Bogliolo,et al.  A Lossy Compression Technique Enabling Duplication-Aware Sequence Alignment , 2012, Evolutionary bioinformatics online.

[20]  Adam C. Siepel,et al.  PHAST and RPHAST: phylogenetic analysis with space/time models , 2011, Briefings Bioinform..

[21]  Lior Pachter,et al.  Multiple alignment by sequence annealing , 2007, Bioinform..

[22]  Gary Benson,et al.  Sequence alignment with tandem duplication , 1997, RECOMB '97.

[23]  Ian Holmes,et al.  Dynamic programming alignment accuracy , 1998, RECOMB '98.

[24]  Mary Goldman,et al.  The UCSC Genome Browser database: extensions and updates 2011 , 2011, Nucleic Acids Res..

[25]  Robert S. Harris,et al.  Improved pairwise alignment of genomic dna , 2007 .

[26]  Matthieu Legendre,et al.  Variable tandem repeats accelerate evolution of coding and regulatory sequences. , 2010, Annual review of genetics.

[27]  Philipp W. Messer,et al.  The majority of recent short DNA insertions in the human genome are tandem duplications. , 2007, Molecular biology and evolution.

[28]  Jens Stoye,et al.  Comparing Tandem Repeats with Duplications and Excisions of Variable Degree , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[29]  Gregory Kucherov,et al.  mreps: efficient and flexible detection of tandem repeats in DNA , 2003, Nucleic Acids Res..

[30]  Dan Geiger,et al.  Finding approximate tandem repeats in genomic sequences , 2004, RECOMB.