Non-globular Domains in Protein Sequences: Automated Segmentation Using Complexity Measures

Computational methods based on mathematically-defined measures of compositional complexity have been developed to distinguish globular and non-globular regions of protein sequences. Compact globular structures in protein molecules are shown to be determined by amino acid sequences of high informational complexity. Sequences of known crystal structure in the Brookhaven Protein Data Bank differ only slightly from randomly shuffled sequences in the distribution of statistical properties such as local compositional complexity. In contrast, in the much larger body of deduced sequences in the SWISS-PROT database, approximately one quarter of the residues occur in segments of non-randomly low complexity and approximately half of the entries contain at least one such segment. Sequences of proteins with known, physicochemically-defined non-globular regions have been analyzed, including collagens, different classes of coiled-coil proteins, elastins, histones, non-histone proteins, mucins, proteoglycan core proteins and proteins containing long single solvent-exposed alpha-helices. The SEG algorithm provides an effective general method for partitioning the globular and non-globular regions of these sequences fully automatically. This method is also facilitating the discovery of new classes of long, non-globular sequence segments, as illustrated by the example of the human CAN gene product involved in tumor induction.

[1]  M L Chu,et al.  Organization of the human pro-alpha 2(I) collagen gene. , 1987, The Journal of biological chemistry.

[2]  M. Fornerod,et al.  The translocation (6;9), associated with a specific subtype of acute myeloid leukemia, results in the fusion of two genes, dek and can, and the expression of a chimeric, leukemia-specific dek-can mRNA , 1992, Molecular and cellular biology.

[3]  S. Altschul,et al.  Issues in searching molecular sequence databases , 1994, Nature Genetics.

[4]  I. Kuntz,et al.  A molecular dynamics simulation of polyalanine: An analysis of equilibrium motions and helix–coil transitions , 1991, Biopolymers.

[5]  James F. Conway,et al.  Three-stranded α-fibrous proteins: the heptad repeat and its implications for structure , 1991 .

[6]  K. Doege,et al.  Complete coding sequence and deduced primary structure of the human cartilage large aggregating proteoglycan, aggrecan. Human-specific repeats, and additional alternatively spliced forms. , 1991, The Journal of biological chemistry.

[7]  A. Lupas,et al.  Predicting coiled coils from protein sequences , 1991, Science.

[8]  D. Anderson,et al.  Complete sequence and organization of the human cardiac beta-myosin heavy chain gene. , 1990, Nucleic acids research.

[9]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1977, Journal of molecular biology.

[10]  Amos Bairoch,et al.  The SWISS-PROT protein sequence data bank, recent developments , 1993, Nucleic Acids Res..

[11]  H. Scheraga,et al.  Prediction of the location of structural domains in globular proteins , 1988, Journal of protein chemistry.

[12]  S. Karlin,et al.  Chance and statistical significance in protein and DNA sequence analysis. , 1992, Science.

[13]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[14]  Marieke,et al.  Can, a putative oncogene associated with myeloid leukemogenesis, may be activated by fusion of its 3' half to different genes: characterization of the set gene , 1992, Molecular and cellular biology.

[15]  William R. Taylor,et al.  Protein Structure Prediction From Sequence , 1993, Comput. Chem..

[16]  D J Prockop,et al.  A new epidermal growth factor-like domain in the human core protein for the large cartilage-specific proteoglycan. Evidence for alternative splicing of the domain. , 1989, The Journal of biological chemistry.

[17]  G. N. Ramachandran,et al.  Stereochemistry of collagen. , 2009, International journal of peptide and protein research.

[18]  P. Graceffa,et al.  A long helix from the central region of smooth muscle caldesmon. , 1991, The Journal of biological chemistry.

[19]  J C Wootton,et al.  The Q-linker: a class of interdomain sequences found in bacterial multidomain regulatory proteins. , 1989, Protein engineering.

[20]  J. Karn,et al.  Periodic features in the amino acid sequence of nematode myosin rod. , 1983, Journal of molecular biology.

[21]  Peter Salamon,et al.  A Maximum Entropy Principle for the Distribution of Local Complexity in Naturally Occurring Nucleotide Sequences , 1992, Comput. Chem..

[22]  J. Schleich,et al.  The complete sequence of the human beta-myosin heavy chain gene and a comparative analysis of its product. , 1990, Genomics.

[23]  Lars Kai Hansen,et al.  On the Robustness of Maximum Entropy Relationships for Complexity Distributions of Nucleotide Sequences , 1993, Comput. Chem..

[24]  Konyshev Va Correlation between average protein composition and amino acid properties , 1983 .

[25]  O. Ptitsyn,et al.  Why do globular proteins fit the limited set of folding patterns? , 1987, Progress in biophysics and molecular biology.

[26]  John C. Wootton,et al.  Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases , 1993, Comput. Chem..

[27]  D. Parry,et al.  α‐Helical coiled coils and bundles: How to design an α‐helical protein , 1990 .

[28]  Patrick Argos,et al.  The Language of Protein Folding: Many Forked Tongues , 1992, Comput. Chem..

[29]  P M Steinert,et al.  Molecular and cellular biology of intermediate filaments. , 1988, Annual review of biochemistry.

[30]  C. Sander,et al.  Comprehensive sequence analysis of the 182 predicted open reading frames of yeast chromosome III , 1992, Protein science : a publication of the Protein Society.

[31]  D Thirumalai,et al.  The nature of folded states of globular proteins , 1992, Biopolymers.

[32]  W. Kauzmann Some factors in the interpretation of protein denaturation. , 1959, Advances in protein chemistry.

[33]  A V Finkelstein,et al.  The classification and origins of protein folding patterns. , 1990, Annual review of biochemistry.

[34]  D A Parry,et al.  Alpha-helical coiled coils: more facts and better predictions. , 1994, Science.

[35]  G. R. Dodge,et al.  Primary structure of the human heparan sulfate proteoglycan from basement membrane (HSPG2/perlecan). A chimeric molecule with multiple domains homologous to the low density lipoprotein receptor, laminin, neural cell adhesion molecules, and epidermal growth factor. , 1992, The Journal of biological chemistry.

[36]  P. Green,et al.  Ancient conserved regions in new gene sequences and the protein databases. , 1993, Science.

[37]  A. Mclachlan,et al.  Analysis of gene duplication repeats in the myosin rod. , 1983, Journal of molecular biology.

[38]  A I Caplan,et al.  Ultrastructural characterization of embryonic chick cartilage proteoglycan core protein and the mapping of a monoclonal antibody epitope. , 1990, The Journal of biological chemistry.