Learning parameter sets for alignment advising

While the multiple sequence alignment output by an aligner strongly depends on the parameter values used for the alignment scoring function (such as the choice of gap penalties and substitution scores), most users rely on the single default parameter setting provided by the aligner. A different parameter setting, however, might yield a much higher-quality alignment for the specific set of input sequences. The problem of picking a good choice of parameter values for specific input sequences is called parameter advising. A parameter advisor has two ingredients: (i) a set of parameter choices to select from, and (ii) an estimator that provides an estimate of the accuracy of the alignment computed by the aligner using a parameter choice. The parameter advisor picks the parameter choice from the set whose resulting alignment has highest estimated accuracy. We consider for the first time the problem of learning the optimal set of parameter choices for a parameter advisor that uses a given accuracy estimator. The optimal set is one that maximizes the expected true accuracy of the resulting parameter advisor, averaged over a collection of training data. While we prove that learning an optimal set for an advisor is NP-complete, we show there is a natural approximation algorithm for this problem, and prove a tight bound on its approximation ratio. Experiments with an implementation of this approximation algorithm on biological benchmarks, using various accuracy estimators from the literature, show it finds sets for advisors that are surprisingly close to optimal. Furthermore, the resulting parameter advisors are significantly more accurate in practice than simply aligning with a single default parameter choice.

[1]  Tal Pupko,et al.  An alignment confidence score capturing robustness to guide tree uncertainty. , 2010, Molecular biology and evolution.

[2]  Dan Graur,et al.  Heads or tails: a simple reliability check for multiple sequence alignments. , 2007, Molecular biology and evolution.

[3]  J. D. Thompson,et al.  Towards a reliable objective function for multiple sequence alignments. , 2001, Journal of molecular biology.

[4]  John D. Kececioglu,et al.  Estimating the Accuracy of Multiple Alignments and its Use in Parameter Advising , 2012, RECOMB.

[5]  R. Spang,et al.  Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. , 2002, Molecular biology and evolution.

[6]  Tero Aittokallio,et al.  Model-based prediction of sequence alignment quality , 2008, Bioinform..

[7]  Liisa Holm,et al.  COFFEE: an objective function for multiple sequence alignments , 1998, Bioinform..

[8]  Erik L. L. Sonnhammer,et al.  Automatic assessment of alignment quality , 2005, Nucleic acids research.

[9]  Paolo Di Tommaso,et al.  TCS: a new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. , 2014, Molecular biology and evolution.

[10]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Olivier Poch,et al.  BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations , 2001, Nucleic Acids Res..

[12]  Lode Wyns,et al.  SABmark- a benchmark for sequence alignment that covers the entire known fold space , 2005, Bioinform..

[13]  S. Balaji,et al.  PALI - a database of Phylogeny and ALIgnment of homologous protein structures , 2001, Nucleic Acids Res..

[14]  Jimin Pei,et al.  AL2CO: calculation of positional conservation in a protein sequence alignment , 2001, Bioinform..

[15]  S. Balaji,et al.  PALI: a database of alignments and phylogeny of homologous protein structures , 2001, Bioinform..

[16]  Klaudia Frankfurter Computers And Intractability A Guide To The Theory Of Np Completeness , 2016 .

[17]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[18]  John D. Kececioglu,et al.  Accuracy Estimation and Parameter Advising for Protein Multiple Sequence Alignment , 2013, J. Comput. Biol..

[19]  John D. Kececioglu,et al.  Multiple alignment by aligning alignments , 2007, ISMB/ECCB.

[20]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[21]  Jian Ma,et al.  PSAR: measuring multiple sequence alignment reliability by probabilistic sampling , 2011, Nucleic acids research.