Boosting Alignment Accuracy by Adaptive Local Realignment

While mutation rates can vary markedly over the residues of a protein, multiple sequence alignment tools typically use the same values for their scoring-function parameters across a protein’s entire length. We present a new approach, called adaptive local realignment, that in contrast automatically adapts to the diversity of mutation rates along protein sequences. This builds upon a recent technique known as parameter advising that finds global parameter settings for aligners, to adaptively find local settings. Our approach in essence identifies local regions with low estimated accuracy, constructs a set of candidate realignments using a carefully-chosen collection of parameter settings, and replaces the region if a realignment has higher estimated accuracy. This new method of local parameter advising, when combined with prior methods for global advising, boosts alignment accuracy as much as 26% over the best default setting on hard-to-align protein benchmarks, and by 6.4% over global advising alone. Adaptive local realignment, implemented within the Opal aligner using the Facet accuracy estimator, is available at facet.cs.arizona.edu.

[1]  Paolo Di Tommaso,et al.  TCS: a new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. , 2014, Molecular biology and evolution.

[2]  Dan DeBlasio,et al.  Parameter Advising for the Opal Aligner , 2017 .

[3]  S. Balaji,et al.  PALI - a database of Phylogeny and ALIgnment of homologous protein structures , 2001, Nucleic Acids Res..

[4]  Gajendra P. S. Raghava,et al.  OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy , 2003, BMC Bioinformatics.

[5]  Martin C. Frith,et al.  SeqVISTA: a graphical tool for sequence feature visualization and comparison , 2003, BMC Bioinformatics.

[6]  Eran Halperin,et al.  Genotyping common and rare variation using overlapping pool sequencing , 2011, BMC Bioinformatics.

[7]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[8]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[9]  A. Löytynoja,et al.  Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis , 2008, Science.

[10]  Walter M. Fitch,et al.  A method for estimating the number of invariant amino acid coding positions in a gene using cytochrome c as a model case , 1967, Biochemical Genetics.

[11]  John D. Kececioglu,et al.  Learning Parameter-Advising Sets for Multiple Sequence Alignment , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[12]  John D. Kececioglu,et al.  Estimating the Accuracy of Multiple Alignments and its Use in Parameter Advising , 2012, RECOMB.

[13]  John D. Kececioglu,et al.  Multiple alignment by aligning alignments , 2007, ISMB/ECCB.

[14]  R. Spang,et al.  Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. , 2002, Molecular biology and evolution.

[15]  David Haussler,et al.  Meta-Alignment with Crumble and Prune: Partitioning very large alignment problems for performance and parallelization , 2011, BMC Bioinformatics.

[16]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[17]  Olivier Poch,et al.  BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations , 2001, Nucleic Acids Res..

[18]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[19]  Lode Wyns,et al.  SABmark- a benchmark for sequence alignment that covers the entire known fold space , 2005, Bioinform..

[20]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[21]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Osamu Gotoh,et al.  Optimal alignment between groups of sequences and its application to multiple sequence alignment , 1993, Comput. Appl. Biosci..

[23]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[24]  John D. Kececioglu,et al.  Accuracy Estimation and Parameter Advising for Protein Multiple Sequence Alignment , 2013, J. Comput. Biol..

[25]  John D. Kececioglu,et al.  Aligning alignments exactly , 2004, RECOMB.

[26]  John D. Kececioglu,et al.  Parameter Advising for Multiple Sequence Alignment , 2017, Computational Biology.

[27]  Z. Yang,et al.  Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. , 1993, Molecular biology and evolution.