Prediction of protein domain boundaries from sequence alone

We present here a simple approach to identify domain boundaries in proteins of an unknown three‐dimensional structure. Our method is based on the hypothesis that a high‐side chain entropy of a region in a protein chain must be compensated by a high‐residue interaction energy within the region, which could correlate with a well‐structured part of the globule, that is, with a domain unit. For protein domains, this means that the domain boundary is conditioned by amino acid residues with a small value of side chain entropy, which correlates with the side chain size. On the one hand, relatively high Ala and Gly content on the domain boundary results in high conformational entropy of the backbone chain between the domains. On the other hand, the presence of Pro residues leads to the formation of hinges for a relative orientation of domains. The method was applied to 646 proteins with two contiguous domains extracted from the SCOP database with a success rate of 63%. We also report the prediction of domain boundaries for CASP5 targets obtained with the same method.

[1]  D. Wetlaufer Nucleation, rapid folding, and globular intrachain regions in proteins. , 1973, Proceedings of the National Academy of Sciences of the United States of America.

[2]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[3]  J. Richardson,et al.  The anatomy and taxonomy of protein structure. , 1981, Advances in protein chemistry.

[4]  B Busetta,et al.  The prediction of protein domains. , 1984, Biochimica et biophysica acta.

[5]  H. Scheraga,et al.  Prediction of the location of structural domains in globular proteins , 1988, Journal of protein chemistry.

[6]  E. Sonnhammer,et al.  Modular arrangement of proteins as inferred from analysis of homology , 1994, Protein science : a publication of the Protein Society.

[7]  G J Barton,et al.  Continuous and discontinuous domains: An algorithm for the automatic generation of reliable protein domain definitions , 1995, Protein science : a publication of the Protein Society.

[8]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[9]  M J Sternberg,et al.  Identification and analysis of domains in proteins. , 1995, Protein engineering.

[10]  Temple F. Smith,et al.  Multiple domain protein diagnostic patterns , 1996, Protein science : a publication of the Protein Society.

[11]  Xiaojun Guan,et al.  Domain Identification by Clustering Sequence Alignments , 1997, ISMB.

[12]  Sarah A. Teichmann,et al.  DIVCLUS: an automatic method in the GEANFAMMER package that finds homologous domains in single- and multi-domain proteins , 1998, Bioinform..

[13]  J M Thornton,et al.  Domain assignment for protein structures using a consensus approach: Characterization and analysis , 1998, Protein science : a publication of the Protein Society.

[14]  Jérôme Gracy,et al.  Automated protein sequence database classification. II. Delineation Of domain boundaries from sequence similarities , 1998, Bioinform..

[15]  Jérôme Gouzy,et al.  Whole Genome Protein Domain Analysis using a New Method for Domain Clustering , 1999, Comput. Chem..

[16]  V A Namiot,et al.  Hierarchy of the interaction energy distribution in the spatial structure of globular proteins and the problem of domain definition. , 1999, Journal of biomolecular structure & dynamics.

[17]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[18]  Stephen H. Bryant,et al.  Domain size distributions can predict domain boundaries , 2000, Bioinform..

[19]  O. Galzitskaya,et al.  Optimal region of average side‐chain entropy for fast protein folding , 2008, Protein science : a publication of the Protein Society.

[20]  R. A. George,et al.  Snapdragon: a Method to Delineate Protein Structural Domains from Sequence Data , 2022 .