Structure-dependent sequence alignment for remotely related proteins

MOTIVATION The quality of a model structure derived from a comparative modeling procedure is dictated by the accuracy of the predicted sequence-template alignment. As the sequence-template pairs are increasingly remote in sequence relationship, the prediction of the sequence-template alignments becomes increasingly problematic with sequence alignment methods. Structural information of the template, used in connection with the sequence relationship of the sequence-template pair, could significantly improve the accuracy of the sequence-template alignment. In this paper, we describe a sequence-template alignment method that integrates sequence and structural information to enhance the accuracy of sequence-template alignments for distantly related protein pairs. RESULTS The structure-dependent sequence alignment (SDSA) procedure was optimized for coverage and accuracy on a training set of 412 protein pairs; the structures for each of the training pairs are similar (RMSD< approximately 4A) but the sequence relationship is undetectable (average pair-wise sequence identity = 8%). The optimized SDSA procedure was then applied to extend PSI-BLAST local alignments by calculating the global alignments under the constraint of the residue pairs in the local alignments. This composite alignment procedure was assessed with a testing set of 1421 protein pairs, of which the pair-wise structures are similar (RMSD< approximately 4A) but the sequences are marginally related at best in each pair (average pair-wise sequence identity = 13%). The assessment showed that the composite alignment procedure predicted more aligned residues pairs with an average of 27% increase in correctly aligned residues over the standard PSI-BLAST alignments for the protein pairs in the testing set.

[1]  A. Sali,et al.  Comparative protein structure modeling of genes and genomes. , 2000, Annual review of biophysics and biomolecular structure.

[2]  G J Barton,et al.  Evaluation and improvements in the automatic alignment of protein sequences. , 1987, Protein engineering.

[3]  M J Sippl,et al.  Structure-based evaluation of sequence comparison and fold recognition alignment accuracy. , 2000, Journal of molecular biology.

[4]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[5]  B Honig,et al.  An integrated approach to the analysis and modeling of protein sequences and structures. I. Protein structural alignment and a quantitative measure for protein structural distance. , 2000, Journal of molecular biology.

[6]  J. M. Sauder,et al.  Large‐scale comparison of protein sequence alignment algorithms with structure alignments , 2000, Proteins.

[7]  T. Smith,et al.  Alignment of protein sequences using secondary structure: a modified dynamic programming method. , 1990, Protein engineering.

[8]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[9]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[10]  E. Lindahl,et al.  Identification of related proteins on family, superfamily and fold level. , 2000, Journal of molecular biology.

[11]  B. Honig,et al.  An integrated approach to the analysis and modeling of protein sequences and structures. III. A comparative study of sequence conservation in protein structural families using multiple structural alignments. , 2000, Journal of molecular biology.

[12]  R L Jernigan,et al.  Identifying sequence-structure pairs undetected by sequence alignments. , 2000, Protein engineering.

[13]  M Levitt,et al.  Alignment of the amino acid sequences of distantly related proteins using variable gap penalties. , 1986, Protein engineering.

[14]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[15]  B Honig,et al.  An integrated approach to the analysis and modeling of protein sequences and structures. II. On the relationship between sequence and structural similarity for proteins that are not obviously related in sequence. , 2000, Journal of molecular biology.

[16]  U. Hobohm,et al.  Selection of representative protein data sets , 1992, Protein science : a publication of the Protein Society.

[17]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[18]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[19]  David C. Jones,et al.  GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. , 1999, Journal of molecular biology.

[20]  Alignment of protein sequences using the hydrophobic core scores. , 1989, Protein engineering.

[21]  Leszek Rychlewski,et al.  Improving the quality of twilight‐zone alignments , 2000, Protein science : a publication of the Protein Society.

[22]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[23]  J. Wootton,et al.  Analysis of compositionally biased regions in sequence databases. , 1996, Methods in enzymology.