Learning Scoring Schemes for Sequence Alignment from Partial Examples

When aligning biological sequences, the choice of parameter values for the alignment scoring function is critical. Small changes in gap penalties, for example, can yield radically different alignments. A rigorous way to compute parameter values that are appropriate for aligning biological sequences is through inverse parametric sequence alignment. Given a collection of examples of biologically correct alignments, this is the problem of finding parameter values that make the scores of the example alignments close to those of optimal alignments for their sequences. We extend prior work on inverse parametric alignment to partial examples, which contain regions where the alignment is left unspecified, and to an improved formulation based on minimizing the average error between the score of an example and the score of an optimal alignment. Experiments on benchmark biological alignments show we can find parameters that generalize across protein families and that boost the accuracy of multiple sequence alignment by as much as 25%.

[1]  Thorsten Joachims,et al.  Support Vector Training of Protein Alignment Models , 2007, RECOMB.

[2]  John P. Overington,et al.  HOMSTRAD: A database of protein structure alignments for homologous families , 1998, Protein science : a publication of the Protein Society.

[3]  Olivier Poch,et al.  BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations , 2001, Nucleic Acids Res..

[4]  John D. Kececioglu,et al.  Aligning alignments exactly , 2004, RECOMB.

[5]  John D. Kececioglu,et al.  Multiple alignment by aligning alignments , 2007, ISMB/ECCB.

[6]  D Gusfield,et al.  Parametric and inverse-parametric sequence alignment with XPARAL. , 1996, Methods in enzymology.

[7]  Yue Lu,et al.  Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences , 2007, RECOMB.

[8]  D. Lipman,et al.  The multiple sequence alignment problem in biology , 1988 .

[9]  S. Balaji,et al.  PALI - a database of Phylogeny and ALIgnment of homologous protein structures , 2001, Nucleic Acids Res..

[10]  John D. Kececioglu,et al.  Inverse Sequence Alignment from Partial Examples , 2007, WABI.

[11]  Dan Gusfield,et al.  Parametric optimization of sequence alignment , 1992, SODA '92.

[12]  E. Lander,et al.  Parametric sequence comparisons. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[13]  L. Lovász,et al.  Geometric Algorithms and Combinatorial Optimization , 1981 .

[14]  David Eppstein Setting Parameters by Example , 2003, SIAM J. Comput..

[15]  Martin Grötschel,et al.  The ellipsoid method and its consequences in combinatorial optimization , 1981, Comb..

[16]  Jimin Pei,et al.  PROMALS: towards accurate multiple sequence alignments of distantly related proteins , 2007, Bioinform..

[17]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[18]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[19]  S. Balaji,et al.  PALI: a database of alignments and phylogeny of homologous protein structures , 2001, Bioinform..

[20]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[21]  John D. Kececioglu,et al.  Simple and Fast Inverse Alignment , 2006, RECOMB.

[22]  Lior Pachter,et al.  Parametric inference for biological sequence analysis. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[23]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[24]  William J. Cook,et al.  Combinatorial optimization , 1997 .

[25]  Yaoqi Zhou,et al.  SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures. , 2005, Bioinformatics.

[26]  David Fernández-Baca,et al.  Inverse parametric sequence alignment , 2004, J. Algorithms.

[27]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[28]  Richard M. Karp,et al.  On Linear Characterizations of Combinatorial Optimization Problems , 1982, SIAM J. Comput..

[29]  Serafim Batzoglou,et al.  CONTRAlign: Discriminative Training for Protein Sequence Alignment , 2006, RECOMB.

[30]  Lode Wyns,et al.  SABmark- a benchmark for sequence alignment that covers the entire known fold space , 2005, Bioinform..

[31]  Jerrold R. Griggs,et al.  On the number of alignments ofk sequences , 1990, Graphs Comb..