SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures.

MOTIVATION Multiple sequence alignment is an essential part of bioinformatics tools for a genome-scale study of genes and their evolution relations. However, making an accurate alignment between remote homologs is challenging. Here, we develop a method, called SPEM, that aligns multiple sequences using pre-processed sequence profiles and predicted secondary structures for pairwise alignment, consistency-based scoring for refinement of the pairwise alignment and a progressive algorithm for final multiple alignment. RESULTS The alignment accuracy of SPEM is compared with those of established methods such as ClustalW, T-Coffee, MUSCLE, ProbCons and PRALINE(PSI) in easy (homologs) and hard (remote homologs) benchmarks. Results indicate that the average sum of pairwise alignment scores given by SPEM are 7-15% higher than those of the methods compared in aligning remote homologs (sequence identity <30%). Its accuracy for aligning homologs (sequence identity >30%) is statistically indistinguishable from those of the state-of-the-art techniques such as ProbCons or MUSCLE 6.0. AVAILABILITY The SPEM server and its executables are available on http://theory.med.buffalo.edu.

[1]  Roland L Dunbrack,et al.  Scoring profile‐to‐profile sequence alignments , 2004, Protein science : a publication of the Protein Society.

[2]  T L Blundell,et al.  FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. , 2001, Journal of molecular biology.

[3]  Kimmen Sjölander,et al.  SATCHMO: Sequence Alignment and Tree Construction Using Hidden Markov Models , 2003, Bioinform..

[4]  William H. Press,et al.  Numerical Recipes: The Art of Scientific Computing , 1987 .

[5]  Liisa Holm,et al.  COFFEE: an objective function for multiple sequence alignments , 1998, Bioinform..

[6]  Hongyi Zhou,et al.  Single‐body residue‐level knowledge‐based energy score combined with sequence‐profile and secondary structure information for fold recognition , 2004, Proteins.

[7]  Yaoqi Zhou,et al.  SPARKS 2 and SP3 servers in CASP6 , 2005, Proteins.

[8]  D. Higgins,et al.  SAGA: sequence alignment by genetic algorithm. , 1996, Nucleic acids research.

[9]  Jimin Pei,et al.  PCMA: fast and accurate multiple sequence alignment based on profile consistency , 2003, Bioinform..

[10]  M J Sippl,et al.  Structure-based evaluation of sequence comparison and fold recognition alignment accuracy. , 2000, Journal of molecular biology.

[11]  Yi Wang,et al.  An adaptive and iterative algorithm for refining multiple sequence alignment , 2004, Comput. Biol. Chem..

[12]  Jaap Heringa,et al.  Two Strategies for Sequence Comparison: Profile-preprocessed and Secondary Structure-induced Multiple Alignment , 1999, Comput. Chem..

[13]  J. Hein,et al.  Combining many multiple alignments in one improved alignment , 1999, Bioinform..

[14]  P. Hogeweg,et al.  The alignment of sets of sequences and the construction of phyletic trees: An integrated method , 2005, Journal of Molecular Evolution.

[15]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[16]  C. Sander,et al.  Correlated mutations and residue contacts in proteins , 1994, Proteins.

[17]  Burkhard Rost,et al.  PHD - an automatic mail server for protein secondary structure prediction , 1994, Comput. Appl. Biosci..

[18]  Cédric Notredame,et al.  3DCoffee: combining protein sequences and structures within multiple sequence alignments. , 2004, Journal of molecular biology.

[19]  Desmond G. Higgins,et al.  Evaluation of iterative alignment algorithms for multiple alignment , 2005, Bioinform..

[20]  A. Godzik,et al.  Comparison of sequence profiles. Strategies for structural predictions using sequence information , 2008, Protein science : a publication of the Protein Society.

[21]  W. Press,et al.  Numerical Recipes: The Art of Scientific Computing , 1987 .

[22]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[23]  O. Gotoh Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. , 1996, Journal of molecular biology.

[24]  J Skolnick,et al.  Defrosting the frozen approximation: PROSPECTOR— A new approach to threading , 2001, Proteins.

[25]  A. Dress,et al.  Multiple DNA and protein sequence alignment based on segment-to-segment comparison. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[26]  J. Heringa,et al.  Homology-extended sequence alignment , 2005, Nucleic acids research.

[27]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[28]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[29]  W. Press,et al.  Numerical Recipes in C++: The Art of Scientific Computing (2nd edn)1 Numerical Recipes Example Book (C++) (2nd edn)2 Numerical Recipes Multi-Language Code CD ROM with LINUX or UNIX Single-Screen License Revised Version3 , 2003 .

[30]  J. Devereux,et al.  A comprehensive set of sequence analysis programs for the VAX , 1984, Nucleic Acids Res..

[31]  W R Taylor,et al.  Protein structure alignment. , 1989, Journal of molecular biology.

[32]  T. Blundell,et al.  Definition of general topological equivalence in protein structures. A procedure involving comparison of properties and relationships through simulated annealing and dynamic programming. , 1990, Journal of molecular biology.

[33]  S. Wodak,et al.  Optimal protein structure alignments by multiple linkage clustering: application to distantly related proteins. , 1995, Protein engineering.

[34]  Lode Wyns,et al.  SABmark- a benchmark for sequence alignment that covers the entire known fold space , 2005, Bioinform..

[35]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[36]  R B Russell,et al.  Fold recognition from sequence comparisons , 2001, Proteins.

[37]  John P. Overington,et al.  HOMSTRAD: A database of protein structure alignments for homologous families , 1998, Protein science : a publication of the Protein Society.

[38]  Y Xu,et al.  Protein threading using PROSPECT: Design and evaluation , 2000, Proteins.

[39]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[40]  Rachel Kolodny,et al.  Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. , 2005, Journal of molecular biology.

[41]  S. Altschul,et al.  A tool for multiple sequence alignment. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[42]  Terri K. Attwood,et al.  The PRINTS Database: A Resource for Identification of Protein Families , 2002, Briefings Bioinform..

[43]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[44]  William R. Taylor,et al.  Iterated Sequence Databank Search Methods , 1999, Comput. Chem..

[45]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[46]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[47]  C. Notredame,et al.  Recent progress in multiple sequence alignment: a survey. , 2002, Pharmacogenomics.

[48]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[49]  W. Kabsch A discussion of the solution for the best rotation to relate two sets of vectors , 1978 .

[50]  C. Sander,et al.  The FSSP database of structurally aligned protein fold families. , 1994, Nucleic acids research.

[51]  D. Fischer,et al.  Protein fold recognition using sequence‐derived predictions , 1996, Protein science : a publication of the Protein Society.

[52]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[53]  Jens Stoye,et al.  DCA: an efficient implementation of the divide-and-conquer approach to simultaneous multiple sequence alignment , 1997, Comput. Appl. Biosci..

[54]  G. Barton,et al.  Multiple protein sequence alignment from tertiary structure comparison: Assignment of global and residue confidence levels , 1992, Proteins.

[55]  C. Sander,et al.  Correlated Mutations and Residue Contacts , 1994 .

[56]  S Karlin,et al.  A symmetric-iterated multiple alignment of protein sequences. , 1998, Journal of molecular biology.

[57]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[58]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[59]  Hongyi Zhou,et al.  Fold recognition by combining sequence profiles derived from evolution and from depth‐dependent structural alignment of fragments , 2004, Proteins.

[60]  Golan Yona,et al.  Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. , 2002, Journal of molecular biology.

[61]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[62]  Lawrence J. Berliner,et al.  Spin label investigations of α-chymotrypsin active site structure in single crystals☆ , 1979 .

[63]  Anders Krogh,et al.  Hidden Markov models for sequence analysis: extension and analysis of the basic method , 1996, Comput. Appl. Biosci..