ProbCons: Probabilistic consistency-based multiple sequence alignment.

To study gene evolution across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein families. Obtaining accurate alignments, however, is a difficult computational problem because of not only the high computational cost but also the lack of proper objective functions for measuring alignment quality. In this paper, we introduce probabilistic consistency, a novel scoring function for multiple sequence comparisons. We present ProbCons, a practical tool for progressive protein multiple sequence alignment based on probabilistic consistency, and evaluate its performance on several standard alignment benchmark data sets. On the BAliBASE, SABmark, and PREFAB benchmark alignment databases, ProbCons achieves statistically significant improvement over other leading methods while maintaining practical speed. ProbCons is publicly available as a Web resource.

[1]  P. Sneath,et al.  Numerical Taxonomy , 1962, Nature.

[2]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[3]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[4]  C. Metz Basic principles of ROC analysis. , 1978, Seminars in nuclear medicine.

[5]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[6]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[7]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[8]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[9]  Eugene W. Myers,et al.  Optimal alignments in linear space , 1988, Comput. Appl. Biosci..

[10]  D. Lipman,et al.  The multiple sequence alignment problem in biology , 1988 .

[11]  S F Altschul,et al.  Weights for data related by a tree. , 1989, Journal of molecular biology.

[12]  Martin Vingron,et al.  A fast and sensitive multiple sequence alignment algorithm , 1989, Comput. Appl. Biosci..

[13]  P. Argos,et al.  Determination of reliable regions in protein sequence alignments. , 1990, Protein engineering.

[14]  O. Gotoh Consistency of optimal sequence alignments. , 1990, Bulletin of Mathematical Biology.

[15]  P. Argos,et al.  Motif recognition and alignment for many sequences by comparison of dot-matrices. , 1991, Journal of molecular biology.

[16]  W. Miller,et al.  A time-efficient, linear-space local similarity algorithm , 1991 .

[17]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[18]  Peter J. Munson,et al.  A novel randomized iterative strategy for aligning multiple protein sequences , 1991, Comput. Appl. Biosci..

[19]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[20]  G. Gonnet,et al.  Exhaustive matching of the entire protein sequence database. , 1992, Science.

[21]  Kun-Mao Chao,et al.  Locating well-conserved regions within a pairwise alignment , 1993, Comput. Appl. Biosci..

[22]  John D. Kececioglu,et al.  The Maximum Weight Trace Problem in Multiple Sequence Alignment , 1993, CPM.

[23]  C. Sander,et al.  The FSSP database of structurally aligned protein fold families. , 1994, Nucleic acids research.

[24]  M S Waterman,et al.  Sequence alignment and penalty choice. Review of concepts, case studies and implications. , 1994, Journal of molecular biology.

[25]  B. Rost,et al.  Combining evolutionary information and neural networks to predict protein secondary structure , 1994, Proteins.

[26]  Moon-Jung Chung,et al.  Multiple sequence alignment using simulated annealing , 1994, Comput. Appl. Biosci..

[27]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[28]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[29]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[30]  S. Miyazawa A reliable sequence alignment method based on probabilities of residue correspondences. , 1995, Protein engineering.

[31]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[32]  S. Wodak,et al.  Optimal protein structure alignments by multiple linkage clustering: application to distantly related proteins. , 1995, Protein engineering.

[33]  D. Higgins,et al.  SAGA: sequence alignment by genetic algorithm. , 1996, Nucleic acids research.

[34]  O. Gotoh Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. , 1996, Journal of molecular biology.

[35]  Multiple DNA and protein sequence alignment based on segment-to-segment comparison (sequence similarityypartial alignmentsyfragment comparisonydynamic programmingyfunctional-site identification) , 1996 .

[36]  A. Dress,et al.  Multiple DNA and protein sequence alignment based on segment-to-segment comparison. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[37]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[38]  Ian Holmes,et al.  Dynamic programming alignment accuracy , 1998, RECOMB '98.

[39]  Burkhard Morgenstern,et al.  DIALIGN: finding local similarities by multiple sequence alignment , 1998, Bioinform..

[40]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[41]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[42]  Liisa Holm,et al.  COFFEE: an objective function for multiple sequence alignments , 1998, Bioinform..

[43]  Sean R. Eddy,et al.  Pfam: multiple sequence alignments and HMM-profiles of protein domains , 1998, Nucleic Acids Res..

[44]  John P. Overington,et al.  HOMSTRAD: A database of protein structure alignments for homologous families , 1998, Protein science : a publication of the Protein Society.

[45]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[46]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[47]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[48]  Olivier Poch,et al.  BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..

[49]  G. Church,et al.  Alignment and structure prediction of divergent protein families: periplasmic and outer membrane proteins of bacterial efflux pumps. , 1999, Journal of molecular biology.

[50]  J. M. Sauder,et al.  Large‐scale comparison of protein sequence alignment algorithms with structure alignments , 2000, Proteins.

[51]  Patrice Koehl,et al.  The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[52]  M. Lässig,et al.  Finite-temperature sequence alignment. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[53]  Terri K. Attwood,et al.  The PRINTS Database: A Resource for Identification of Protein Families , 2002, Briefings Bioinform..

[54]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[55]  Maximilian Schlosshauer,et al.  A novel approach to local reliability of sequence alignments , 2002, Bioinform..

[56]  Adam Godzik,et al.  In search for more accurate alignments in the twilight zone , 2002, Protein science : a publication of the Protein Society.

[57]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence Project: update and current status , 2003, Nucleic Acids Res..

[58]  Michael Lappe,et al.  Accurate detection of very sparse sequence motifs , 2003, RECOMB '03.

[59]  Lode Wyns,et al.  Align-m-a new algorithm for multiple alignment of highly divergent sequences , 2004, Bioinform..

[60]  Michael Brudno,et al.  PROBCONS: Probabilistic Consistency-Based Multiple Alignment of Amino Acid Sequences , 2004, AAAI.

[61]  Fyodor A Kondrashov,et al.  The functional genomic distribution of protein divergence in two animal phyla: coevolution, genomic conflict, and constraint. , 2004, Genome research.

[62]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[63]  M. Waterman,et al.  Comparative biosequence metrics , 2005, Journal of Molecular Evolution.

[64]  R. Doolittle,et al.  Progressive sequence alignment as a prerequisitetto correct phylogenetic trees , 2007, Journal of Molecular Evolution.