论文信息 - MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities

MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities

MOTIVATION Multiple sequence alignment is of central importance to bioinformatics and computational biology. Although a large number of algorithms for computing a multiple sequence alignment have been designed, the efficient computation of highly accurate multiple alignments is still a challenge. RESULTS We present MSAProbs, a new and practical multiple alignment algorithm for protein sequences. The design of MSAProbs is based on a combination of pair hidden Markov models and partition functions to calculate posterior probabilities. Furthermore, two critical bioinformatics techniques, namely weighted probabilistic consistency transformation and weighted profile-profile alignment, are incorporated to improve alignment accuracy. Assessed using the popular benchmarks: BAliBASE, PREFAB, SABmark and OXBENCH, MSAProbs achieves statistically significant accuracy improvements over the existing top performing aligners, including ClustalW, MAFFT, MUSCLE, ProbCons and Probalign. Furthermore, MSAProbs is optimized for multi-core CPUs by employing a multi-threaded design, leading to a competitive execution time compared to other aligners. AVAILABILITY The source code of MSAProbs, written in C++, is freely and publicly available from http://msaprobs.sourceforge.net.

[1] Rodrigo Lopez,et al. Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[2] Patrice Koehl,et al. The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[3] Chris Sander,et al. Touring protein fold space with Dali/FSSP , 1998, Nucleic Acids Res..

[4] N. Saitou,et al. The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[5] Olivier Poch,et al. BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[6] Robert C. Edgar,et al. MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[7] Robert C. Edgar,et al. Quality measures for protein alignment benchmarks , 2010, Nucleic acids research.

[8] S Subbiah,et al. A method for multiple sequence alignment with gaps. , 1989, Journal of molecular biology.

[9] P E Bourne,et al. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[10] J. Thompson,et al. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[11] C. Chothia,et al. Volume changes in protein evolution. , 1994, Journal of molecular biology.

[12] Dennis R. Livesay,et al. Probalign: multiple sequence alignment using partition function posterior probabilities , 2006, Bioinform..

[13] Robert C. Edgar,et al. MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[14] K. Katoh,et al. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[15] Gajendra P. S. Raghava,et al. OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy , 2003, BMC Bioinformatics.

[16] Chuong B. Do,et al. ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[17] Frank Wilcoxon,et al. Probability tables for individual comparisons by ranking methods. , 1947 .

[18] Bioinformatics Applications Note , .

[19] S. Miyazawa. A reliable sequence alignment method based on probabilities of residue correspondences. , 1995, Protein engineering.

[20] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[21] Osamu Gotoh,et al. A weighting system and algorithm for aligning many phylogenetically related sequences , 1995, Comput. Appl. Biosci..

[22] G. Barton,et al. Multiple protein sequence alignment from tertiary structure comparison: Assignment of global and residue confidence levels , 1992, Proteins.

[23] G. Gonnet,et al. Exhaustive matching of the entire protein sequence database. , 1992, Science.

[24] M. Sternberg,et al. A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons. , 1987, Journal of molecular biology.

[25] Sean R. Eddy,et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[26] A. Krogh. Hidden Markov Models in Computational Biology Applications to Protein Modeling UCSC CRL , 1993 .

[27] F. Corpet. Multiple sequence alignment with hierarchical clustering. , 1988, Nucleic acids research.

[28] Olivier Poch,et al. BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..

[29] Lode Wyns,et al. Align-m-a new algorithm for multiple alignment of highly divergent sequences , 2004, Bioinform..

[30] R. Doolittle,et al. Progressive sequence alignment as a prerequisitetto correct phylogenetic trees , 2007, Journal of Molecular Evolution.

[31] Liisa Holm,et al. COFFEE: an objective function for multiple sequence alignments , 1998, Bioinform..

[32] Olivier Poch,et al. BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations , 2001, Nucleic Acids Res..

[33] S. Wodak,et al. Optimal protein structure alignments by multiple linkage clustering: application to distantly related proteins. , 1995, Protein engineering.

[34] D. Higgins,et al. T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[35] K. Katoh,et al. MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[36] S. Henikoff,et al. Position-based sequence weights. , 1994, Journal of molecular biology.

[37] Peter J. Munson,et al. A novel randomized iterative strategy for aligning multiple protein sequences , 1991, Comput. Appl. Biosci..

[38] J. A. Studier,et al. A note on the neighbor-joining algorithm of Saitou and Nei. , 1988, Molecular biology and evolution.