Gaps in structurally similar proteins: Towards improvement of multiple sequence alignment

An algorithm was developed to locally optimize gaps from the FSSP database. Over 2 million gaps were identified from all versus all FSSP structure comparisons, and datasets of non‐identical gaps and flanking regions comprising between 90,000 and 135,000 sequence fragments were extracted for statistical analysis. Relative to background frequencies, gaps were enriched in residue types with small side chains and high turn propensity (D, G, N, P, S), and were depleted in residue types with hydrophobic side chains (C, F, I, L, V, W, Y). In contrast, regions flanking a gap exhibited opposite trends in amino acid frequencies, i.e., enrichment in hydrophobic residues and a high degree of secondary structure. Log‐odds scores of residue type as a function of position in or around a gap were derived from the statistics. Three simple experiments demonstrated that these scores contained significant predictive information. First, regions where gaps were observed in single sequences taken from HOMSTRAD structure‐based multiple sequence alignments generally scored higher than regions where gaps were not observed. Second, given the correct pairwise‐aligned cores, the actual positions of gaps could be reproduced from sequence more accurately using the structurally‐derived statistics than by using random pairwise alignments. Finally, revision of the Clustal‐W residue‐specific gap opening parameters with this new information improved the agreement of Clustal‐W alignments with the structure‐based alignments. At least three applications for these results are envisioned: improvement of gap penalties in pairwise (or multiple) sequence alignment, prediction of regions of single sequences likely (or unlikely) to contain indels, and more accurate placement of gaps in automated pairwise structure alignment. Proteins 2003;53:000–000. © 2003 Wiley‐Liss, Inc.

[1]  D. Eisenberg Proteins. Structures and molecular properties, T.E. Creighton. W. H. Freeman and Company, New York (1984), 515, $36.95 , 1985 .

[2]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[3]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[4]  P. Argos,et al.  Analysis of insertions/deletions in protein structures. , 1992, Journal of molecular biology.

[5]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[6]  G. Gonnet,et al.  Empirical and structural models for insertions and deletions in the divergent evolution of proteins. , 1993, Journal of molecular biology.

[7]  M S Waterman,et al.  Sequence alignment and penalty choice. Review of concepts, case studies and implications. , 1994, Journal of molecular biology.

[8]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[9]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[10]  S. Henikoff,et al.  Position-based sequence weights. , 1994, Journal of molecular biology.

[11]  P. Argos,et al.  An assessment of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. , 1995, Journal of molecular biology.

[12]  W. Pearson Comparison of methods for searching protein sequence databases , 1995, Protein science : a publication of the Protein Society.

[13]  L. Råde,et al.  Mathematics handbook for science and engineering , 1995 .

[14]  W R Taylor,et al.  Multiple protein sequence alignment: algorithms and gap insertion. , 1996, Methods in enzymology.

[15]  J. Thompson,et al.  Using CLUSTAL for multiple sequence alignments. , 1996, Methods in enzymology.

[16]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[17]  Chris Sander,et al.  Dali/FSSP classification of three-dimensional protein folds , 1997, Nucleic Acids Res..

[18]  R. Aurora,et al.  Helix capping , 1998, Protein science : a publication of the Protein Society.

[19]  S F Altschul,et al.  Generalized affine gap costs for protein sequence alignment , 1998, Proteins.

[20]  John P. Overington,et al.  HOMSTRAD: A database of protein structure alignments for homologous families , 1998, Protein science : a publication of the Protein Society.

[21]  G. Mitchison A Probabilistic Treatment of Phylogeny and Sequence Alignment , 1999, Journal of Molecular Evolution.

[22]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[23]  S. Sunyaev,et al.  PSIC: profile extraction from sequence alignments with position-specific counts of independent observations. , 1999, Protein engineering.

[24]  T L Blundell,et al.  FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. , 2001, Journal of molecular biology.

[25]  Jiye Shi,et al.  HOMSTRAD: adding sequence information to structure-based alignments of homologous protein families , 2001, Bioinform..

[26]  N. Grishin Fold change in evolution of protein structures. , 2001, Journal of structural biology.

[27]  M J Sternberg,et al.  An approach to improving multiple alignments of protein sequences using predicted secondary structure. , 2001, Protein engineering.

[28]  B Qian,et al.  Distribution of indel lengths , 2001, Proteins.

[29]  William R. Pearson,et al.  Empirical determination of effective gap penalties for sequence comparison , 2002, Bioinform..

[30]  N. Grishin,et al.  COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. , 2003, Journal of molecular biology.

[31]  J. Felsenstein,et al.  Inching toward reality: An improved likelihood model of sequence evolution , 2004, Journal of Molecular Evolution.

[32]  Lvek,et al.  Evolution of protein structures and functions , 2022 .