Domain size distributions can predict domain boundaries

MOTIVATION The sizes of protein domains observed in the 3D-structure database follow a surprisingly narrow distribution. Structural domains are furthermore formed from a single-chain continuous segment in over 80% of instances. These observations imply that some choices of domain boundaries on an otherwise uncharacterized sequence are more likely than others, based solely on the size and segment number of predicted domains. This property might be used to guess the locations of protein domain boundaries. RESULTS To test this possibility we enumerate putative domain boundaries and calculate their relative likelihood under a probability model that considers only the size and segment number of predicted domains. We ask, in a cross-validated test using sequences with known 3D structure, whether the most likely guesses agree with the observed domain structure. We find that domain boundary predictions are surprisingly successful for sequences up to 400 residues long and that guessing domain boundaries in this way can improve the sensitivity of threading analysis.

[1]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[2]  Bernd Bukau,et al.  The Hsp70 and Hsp60 Chaperone Machines , 1998, Cell.

[3]  A. Panchenko,et al.  Threading with explicit models for evolutionary conservation of structure and sequence , 1999, Proteins.

[4]  A. Murzin Structure classification‐based assessment of CASP3 predictions for the fold recognition targets , 1999, Proteins.

[5]  S. Bryant,et al.  Threading a database of protein cores , 1995, Proteins.

[6]  S. Bryant,et al.  Critical assessment of methods of protein structure prediction (CASP): Round II , 1997, Proteins.

[7]  J M Thornton,et al.  Domain assignment for protein structures using a consensus approach: Characterization and analysis , 1998, Protein science : a publication of the Protein Society.

[8]  T L Blundell,et al.  A database of globular protein structural domains: clustering of representative family members into similar folds. , 1996, Folding & design.

[9]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[10]  F. Hartl,et al.  Molecular chaperones in cellular protein folding. , 1995, BioEssays : news and reviews in molecular, cellular and developmental biology.

[11]  S H Bryant,et al.  A measure of progress in fold recognition? , 1999, Proteins.

[12]  J F Gibrat,et al.  Surprising similarities in structure comparison. , 1996, Current opinion in structural biology.

[13]  S H Bryant,et al.  Measures of threading specificity and accuracy , 1997, Proteins.

[14]  A. Horwich,et al.  The Hsp 70 and Hsp 60 Review Chaperone Machines , 1998 .

[15]  M J Sternberg,et al.  Identification and analysis of domains in proteins. , 1995, Protein engineering.

[16]  A. L. Berman,et al.  Underlying order in protein sequence organization. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[17]  T. Hubbard,et al.  Critical assessment of methods of protein structure prediction (CASP): Round III , 1999 .

[18]  S. Bryant,et al.  Statistics of sequence-structure threading. , 1995, Current opinion in structural biology.

[19]  S. Bryant,et al.  Identification of homologous core structures , 1999, Proteins.

[20]  Yanli Wang,et al.  MMDB: 3D structure data in Entrez , 2000, Nucleic Acids Res..