Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

Sequence alignment is a common method for finding protein structurally conserved/similar regions. However, sequence alignment is often not accurate if sequence identities between to-be-aligned sequences are less than 30%. This is because that for these sequences, different residues may play similar structural roles and they are incorrectly aligned during the sequence alignment using substitution matrix consisting of 20 types of residues. Based on the similarity of physicochemical features, residues can be clustered into a few groups. Using such simplified alphabets, the complexity of protein sequences is reduced and at the same time the key information encoded in the sequences remains. As a result, the accuracy of sequence alignment might be improved if the residues are properly clustered. Here, by using a database of aligned protein structures (DAPS), a new clustering method based on the substitution scores is proposed for the grouping of residues, and substitution matrices of residues at different levels of simplification are constructed. The validity of the reduced alphabets is confirmed by relative entropy analysis. The reduced alphabets are applied to recognition of protein structurally conserved/similar regions by sequence alignment. The results indicate that the accuracy or efficiency of sequence alignment can be improved with the optimal reduced alphabet with N around 9.

[1]  L. Regan,et al.  Characterization of a helical protein designed from first principles. , 1988, Science.

[2]  Hongyi Zhou,et al.  Fold recognition by combining sequence profiles derived from evolution and from depth‐dependent structural alignment of fragments , 2004, Proteins.

[3]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[4]  N. D. Clarke,et al.  Sequence 'minimization': exploring the sequence landscape with simplified sequences. , 1995, Current opinion in biotechnology.

[5]  D. Higgins,et al.  See Blockindiscussions, Blockinstats, Blockinand Blockinauthor Blockinprofiles Blockinfor Blockinthis Blockinpublication Clustal: Blockina Blockinpackage Blockinfor Blockinperforming Multiple Blockinsequence Blockinalignment Blockinon Blockina Minicomputer Article Blockin Blockinin Blockin , 2022 .

[6]  David Eisenberg,et al.  The directional atomic solvation energy: An atom-based potential for the assignment of protein sequences to known folds , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[7]  D. T. Jones,et al.  A new approach to protein fold recognition , 1992, Nature.

[8]  Andrew E. Torda,et al.  Amino acid similarity matrices based on force fields , 2001, Bioinform..

[9]  M Ohya,et al.  Amino acid similarity matrix for homology modeling derived from structural alignment and optimized by the Monte Carlo method. , 1998, Journal of molecular graphics & modelling.

[10]  S. Karlin,et al.  Applications and statistics for multiple high-scoring segments in molecular sequences. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[11]  John P. Overington,et al.  A structural basis for sequence comparisons. An evaluation of scoring methodologies. , 1993, Journal of molecular biology.

[12]  L. H. Bradley,et al.  Protein design by binary patterning of polar and nonpolar amino acids. , 1993, Methods in molecular biology.

[13]  J. Felsenstein CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAP , 1985, Evolution; international journal of organic evolution.

[14]  M. Billeter,et al.  MOLMOL: a program for display and analysis of macromolecular structures. , 1996, Journal of molecular graphics.

[15]  F E Cohen,et al.  Pairwise sequence alignment below the twilight zone. , 2001, Journal of molecular biology.

[16]  Jun Wang,et al.  Reduction of protein sequence complexity by residue grouping. , 2003, Protein engineering.

[17]  C Sander,et al.  Dictionary of recurrent domains in protein structures , 1998, Proteins.

[18]  D Baker,et al.  Simplified proteins: minimalist solutions to the 'protein folding problem'. , 1998, Current opinion in structural biology.

[19]  Wei-Mou Zheng,et al.  Simplified amino acid alphabets based on deviation of conditional probability from random background. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[20]  S. Henikoff,et al.  Automated construction and graphical presentation of protein blocks from unaligned sequences. , 1995, Gene.

[21]  Ke Fan,et al.  What is the minimum number of letters required to fold a protein? , 2003, Journal of molecular biology.

[22]  PFIT and PFRIT: Bioinformatic algorithms for detecting glycosidase function from structure and sequence , 2004, Protein science : a publication of the Protein Society.

[23]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[24]  D. Baker,et al.  Functional rapidly folding proteins from simplified amino acid sequences , 1997, Nature Structural Biology.

[25]  H. Margalit,et al.  Evaluation of PSI‐BLAST alignment accuracy in comparison to structural alignments , 2000, Protein science : a publication of the Protein Society.

[26]  Shmuel Pietrokovski,et al.  The Blocks database--a system for protein classification , 1996, Nucleic Acids Res..

[27]  C Sander,et al.  Mapping the Protein Universe , 1996, Science.

[28]  S. Akanuma,et al.  Combinatorial mutagenesis to restrict amino acid usage in an enzyme to a reduced set , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[29]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[30]  D. Eisenberg,et al.  A method to identify protein sequences that fold into a known three-dimensional structure. , 1991, Science.

[31]  Jun Wang,et al.  A computational approach to simplifying the protein folding alphabet , 1999, Nature Structural Biology.