Simple and Fast Inverse Alignment

For as long as biologists have been computing alignments of sequences, the question of what values to use for scoring substitutions and gaps has persisted. While some choices for substitution scores are now common, largely due to convention, there is no standard for choosing gap penalties. An objective way to resolve this question is to learn the appropriate values by solving the Inverse String Alignment Problem: given examples of correct alignments, find parameter values that make the examples be optimal-scoring alignments of their strings. We present a new polynomial-time algorithm for Inverse String Alignment that is simple to implement, fast in practice, and for the first time can learn hundreds of parameters simultaneously. The approach is also flexible: minor modifications allow us to solve inverse unique alignment (find parameter values that make the examples be the unique optimal alignments of their strings), and inverse near-optimal alignment (find parameter values that make the example alignments be as close to optimal as possible). Computational results with an implementation for global alignment show that, for the first time, we can find best-possible values for all 212 parameters of the standard protein-sequence scoring-model from hundreds of alignments in a few minutes of computation.

[1]  D Gusfield,et al.  Parametric and inverse-parametric sequence alignment with XPARAL. , 1996, Methods in enzymology.

[2]  Jerrold R. Griggs,et al.  On the number of alignments ofk sequences , 1990, Graphs Comb..

[3]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[4]  David Fernández-Baca,et al.  Inverse parametric sequence alignment , 2004, J. Algorithms.

[5]  John D. Kececioglu,et al.  Aligning alignments exactly , 2004, RECOMB.

[6]  David Fernández-Baca,et al.  Bounds for parametric sequence comparison , 2002, Discret. Appl. Math..

[7]  Richard M. Karp,et al.  On Linear Characterizations of Combinatorial Optimization Problems , 1982, SIAM J. Comput..

[8]  David Eppstein,et al.  Finding the k shortest paths , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[9]  Dan Gusfield,et al.  Parametric optimization of sequence alignment , 1992, SODA '92.

[10]  S. Balaji,et al.  PALI: a database of alignments and phylogeny of homologous protein structures , 2001, Bioinform..

[11]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[12]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[13]  Lior Pachter,et al.  Parametric inference for biological sequence analysis. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Martin Grötschel,et al.  The ellipsoid method and its consequences in combinatorial optimization , 1981, Comb..

[15]  S. Balaji,et al.  PALI - a database of Phylogeny and ALIgnment of homologous protein structures , 2001, Nucleic Acids Res..

[16]  E. Lander,et al.  Parametric sequence comparisons. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Thomas Lengauer,et al.  Fast and numerically stable parametric alignment of biosequences , 1997, RECOMB '97.