Probalign: multiple sequence alignment using partition function posterior probabilities

MOTIVATION The maximum expected accuracy optimization criterion for multiple sequence alignment uses pairwise posterior probabilities of residues to align sequences. The partition function methodology is one way of estimating these probabilities. Here, we combine these two ideas for the first time to construct maximal expected accuracy sequence alignments. RESULTS We bridge the two techniques within the program Probalign. Our results indicate that Probalign alignments are generally more accurate than other leading multiple sequence alignment methods (i.e. Probcons, MAFFT and MUSCLE) on the BAliBASE 3.0 protein alignment benchmark. Similarly, Probalign also outperforms these methods on the HOMSTRAD and OXBENCH benchmarks. Probalign ranks statistically highest (P-value < 0.005) on all three benchmarks. Deeper scrutiny of the technique indicates that the improvements are largest on datasets containing N/C-terminal extensions and on datasets containing long and heterogeneous length proteins. These points are demonstrated on both real and simulated data. Finally, our method also produces accurate alignments on long and heterogeneous length datasets containing protein repeats. Here, alignment accuracy scores are at least 10% and 15% higher than the other three methods when standard deviation of length is >300 and 400, respectively. AVAILABILITY Open source code implementing Probalign as well as for producing the simulated data, and all real and simulated data are freely available from http://www.cs.njit.edu/usman/probalign

[1]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[2]  S. Miyazawa A reliable sequence alignment method based on probabilities of residue correspondences. , 1995, Protein engineering.

[3]  Serafim Batzoglou,et al.  CONTRAlign: Discriminative Training for Protein Sequence Alignment , 2006, RECOMB.

[4]  Folker Meyer,et al.  Rose: generating sequence families , 1998, Bioinform..

[5]  Brian T. Sutch,et al.  Predicting protein functional sites with phylogenetic motifs , 2004, Proteins.

[6]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[7]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[8]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[9]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[10]  D. I. Hawkins,et al.  100 Statistical Tests , 1994 .

[11]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[12]  S. Altschul A protein alignment scoring system sensitive at all evolutionary distances , 1993, Journal of Molecular Evolution.

[13]  Olivier Poch,et al.  BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..

[14]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[15]  Lior Pachter,et al.  Alignment Metric Accuracy , 2005, q-bio/0510052.

[16]  Olivier Poch,et al.  BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[17]  John P. Overington,et al.  HOMSTRAD: A database of protein structure alignments for homologous families , 1998, Protein science : a publication of the Protein Society.

[18]  G. Gonnet,et al.  Exhaustive matching of the entire protein sequence database. , 1992, Science.

[19]  Olivier Poch,et al.  BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations , 2001, Nucleic Acids Res..

[20]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[21]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[22]  Gopal Kanji,et al.  100 Statistical Tests , 1994 .

[23]  Peter F. Stadler,et al.  Stochastic pairwise alignments , 2002, ECCB.

[24]  Michael Kaufmann,et al.  BMC Bioinformatics BioMed Central , 2005 .

[25]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[26]  Gajendra P. S. Raghava,et al.  OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy , 2003, BMC Bioinformatics.

[27]  C. Notredame,et al.  Recent progress in multiple sequence alignment: a survey. , 2002, Pharmacogenomics.