A study on protein sequence alignment quality

One of the most central methods in bioinformatics is the alignment of two protein or DNA sequences. However, so far large‐scale benchmarks examining the quality of these alignments are scarce. On the other hand, recently several large‐scale studies of the capacity of different methods to identify related sequences has led to new insights about the performance of fold recognition methods. To increase our understanding about fold recognition methods, we present a large‐scale benchmark of alignment quality. We compare alignments from several different alignment methods, including sequence alignments, hidden Markov models, PSI‐BLAST, CLUSTALW, and threading methods. For most methods, the alignment quality increases significantly at about 20% sequence identity. The difference in alignment quality between different methods is quite small, and the main difference can be seen at the exact positioning of the sharp rise in alignment quality, that is, around 15–20% sequence identity. The alignments are improved by using structural information. In general, the best alignments are obtained by methods that use predicted secondary structure information and sequence profiles obtained from PSI‐BLAST. One interesting observation is that for different pairs many different methods create the best alignments. This finding implies that if a method that could select the best alignment method for each pair existed, a significant improvement of the alignment quality could be gained. Proteins 2002;46:330–339. © 2002 Wiley‐Liss, Inc.

[1]  D. Fischer,et al.  Protein fold recognition using sequence‐derived predictions , 1996, Protein science : a publication of the Protein Society.

[2]  M. Sternberg,et al.  Enhanced genome annotation using structural profiles in the program 3D-PSSM. , 2000, Journal of molecular biology.

[3]  C Venclovas,et al.  Processing and analysis of CASP3 protein structure predictions , 1999, Proteins.

[4]  A. Panchenko,et al.  Combination of threading potentials and sequence profiles improves fold recognition. , 2000, Journal of molecular biology.

[5]  B Rost,et al.  Better 1D predictions by experts with machines , 1997, Proteins.

[6]  M. Levitt,et al.  A unified statistical framework for sequence comparison and structure comparison. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Arne Elofsson,et al.  A study of quality measures for protein threading models , 2001, BMC Bioinformatics.

[8]  A. Sali,et al.  Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[9]  J Lundström,et al.  Pcons: A neural‐network–based consensus predictor that improves fold recognition , 2001, Protein science : a publication of the Protein Society.

[10]  D Eisenberg,et al.  A 3D-1D substitution matrix for protein fold recognition that includes predicted secondary structure of the sequence. , 1997, Journal of molecular biology.

[11]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[12]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[13]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[14]  D Fischer,et al.  LiveBench‐1: Continuous benchmarking of protein structure prediction servers , 2001, Protein science : a publication of the Protein Society.

[15]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[16]  Leszek Rychlewski,et al.  Improving the quality of twilight‐zone alignments , 2000, Protein science : a publication of the Protein Society.

[17]  A. Godzik,et al.  Comparison of sequence profiles. Strategies for structural predictions using sequence information , 2008, Protein science : a publication of the Protein Society.

[18]  J. Garnier,et al.  Fold recognition using predicted secondary structure sequences and hidden Markov models of protein folds , 1997, Proteins.

[19]  E. Lindahl,et al.  Identification of related proteins on family, superfamily and fold level. , 2000, Journal of molecular biology.

[20]  M J Sippl,et al.  Optimum superimposition of protein structures: ambiguities and implications. , 1996, Folding & design.

[21]  Arne Elofsson,et al.  MaxSub: an automated measure for the assessment of protein structure prediction quality , 2000, Bioinform..

[22]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[23]  A. Godzik The structural alignment between two proteins: Is there a unique answer? , 1996, Protein science : a publication of the Protein Society.

[24]  J. M. Sauder,et al.  Large‐scale comparison of protein sequence alignment algorithms with structure alignments , 2000, Proteins.

[25]  B. Rost,et al.  Protein fold recognition by prediction-based threading. , 1997, Journal of molecular biology.

[26]  T J Hubbard RMS/Coverage graphs: A qualitative method for comparing three‐dimensional protein structure predictions , 1999, Proteins.

[27]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[28]  M J Sippl,et al.  Structure-based evaluation of sequence comparison and fold recognition alignment accuracy. , 2000, Journal of molecular biology.

[29]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[30]  C. Chothia,et al.  Intermediate sequences increase the detection of homology between sequences. , 1997, Journal of molecular biology.

[31]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[32]  D. Haussler,et al.  Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[33]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[34]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Bengt Persson,et al.  KIND-a non-redundant protein database , 1999, Bioinform..

[36]  M. Sternberg,et al.  Benchmarking PSI-BLAST in genome annotation. , 1999, Journal of molecular biology.

[37]  R. Abagyan,et al.  Do aligned sequences share the same fold? , 1997, Journal of molecular biology.

[38]  D. T. Jones,et al.  A new approach to protein fold recognition , 1992, Nature.

[39]  N N Alexandrov,et al.  Alignment algorithm for homology modeling and threading , 1998, Protein science : a publication of the Protein Society.

[40]  D Fischer,et al.  CAFASP‐1: Critical assessment of fully automated structure prediction methods , 1999, Proteins.

[41]  J. M. Bradshaw,et al.  Mutational investigation of the specificity determining region of the Src SH2 domain. , 2000, Journal of molecular biology.

[42]  U. Hobohm,et al.  Selection of representative protein data sets , 1992, Protein science : a publication of the Protein Society.