Growth of novel protein structural data

Contrary to popular assumption, the rate of growth of structural data has slowed, and the Protein Data Bank (PDB) has not been growing exponentially since 1995. Reaching such a dramatic conclusion requires careful measurement of growth of novel structures, which can be achieved by clustering entry sequences, or by using a novel index to down-weight entries with a higher number of sequence neighbors. These measures agree, and growth rates are very similar for entire PDB files, clusters, and weighted chains. The overall sizes of Structural Classification of Proteins (SCOP) categories (number of families, superfamilies, and folds) appear to be directly proportional to the number of deposited PDB files. Using our weighted chain count, which is most correlated to the change in the size of each SCOP category in any time period, shows that the rate of increase of SCOP categories is actually slowing down. This enables the final size of each of these SCOP categories to be predicted without examining or comparing protein structures. In the last 3 years, structures solved by structural genomics (SG) initiatives, especially the United States National Institutes of Health Protein Structure Initiative, have begun to redress the slowing growth of the PDB. Structures solved by SG are 3.8 times less sequence-redundant than typical PDB structures. Since mid-2004, SG programs have contributed half the novel structures measured by weighted chain counts. Our analysis does not rely on visual inspection of coordinate sets: it is done automatically, providing an accurate, up-to-date measure of the growth of novel protein structural data.

[1]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[2]  N N Alexandrov,et al.  Biological meaning, statistical significance, and classification of local spatial similarities in nonhomologous proteins , 1994, Protein science : a publication of the Protein Society.

[3]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[4]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[5]  David T. Jones,et al.  Rapid protein domain assignment from amino acid sequence using predicted secondary structure , 2002, Protein science : a publication of the Protein Society.

[6]  Anna Tramontano,et al.  Assessment of homology‐based predictions in CASP5 , 2003, Proteins.

[7]  David A. Lee,et al.  Progress towards mapping the universe of protein folds , 2004, Genome Biology.

[8]  E V Koonin,et al.  Estimating the number of protein folds and families from complete genome data. , 2000, Journal of molecular biology.

[9]  Russell L. Marsden,et al.  Progress of structural genomics initiatives: an analysis of solved target structures. , 2005, Journal of molecular biology.

[10]  David T. Jones,et al.  Protein superfamilles and domain superfolds , 1994, Nature.

[11]  Stella Veretnik,et al.  Toward consistent assignment of structural domains in proteins. , 2004, Journal of molecular biology.

[12]  Tim J. P. Hubbard,et al.  SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[13]  J. Newman,et al.  Class‐directed structure determination: Foundation for a protein structure initiative , 1998, Protein science : a publication of the Protein Society.

[14]  Nathan Linial,et al.  EVEREST: automatic identification and classification of protein domains in all protein sequences , 2006, BMC bioinformatics.

[15]  C. Chothia,et al.  Structural patterns in globular proteins , 1976, Nature.

[16]  Steven E Brenner,et al.  The Impact of Structural Genomics: Expectations and Outcomes , 2005, Science.

[17]  R. G. Hart,et al.  Structure of Myoglobin: A Three-Dimensional Fourier Synthesis at 2 Å. Resolution , 1960, Nature.

[18]  J. Kendrew,et al.  The three-dimensional structure of a protein molecule. , 1961, Scientific American.

[19]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.

[20]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[21]  Keith S. Wilson,et al.  SPINE: Structural Proteomics in Europe - The best of both worlds , 2006 .

[22]  Yutaka Kuroda,et al.  Structural genomics projects in Japan , 2000, Nature Structural Biology.

[23]  B. Rost,et al.  Sequence-based prediction of protein domains. , 2004, Nucleic acids research.

[24]  D. Phillips,et al.  A possible three-dimensional structure of bovine alpha-lactalbumin based on that of hen's egg-white lysozyme. , 1969, Journal of molecular biology.

[25]  Z. X. Wang,et al.  A re-estimation for the total numbers of protein folds and superfamilies. , 1998, Protein engineering.

[26]  Ruben Recabarren,et al.  Estimating the total number of protein folds , 1999, Proteins.

[27]  John D. Westbrook,et al.  TargetDB: a target registration database for structural genomics projects , 2004, Bioinform..

[28]  Rachel Kolodny,et al.  Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. , 2005, Journal of molecular biology.

[29]  P K Warme,et al.  Computation of structures of homologous proteins. Alpha-lactalbumin from lysozyme. , 1974, Biochemistry.

[30]  C. Chothia,et al.  Population statistics of protein structures: lessons from structural classifications. , 1997, Current opinion in structural biology.

[31]  C DeLisi,et al.  Estimating the number of protein folds. , 1998, Journal of molecular biology.

[32]  S. White The progress of membrane protein structure determination , 2004, Protein science : a publication of the Protein Society.

[33]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[34]  John Moult,et al.  A unifold, mesofold, and superfold model of protein fold use , 2002, Proteins.

[35]  D. Cozzetto,et al.  Relationship between multiple sequence alignments and quality of protein comparative models , 2004, Proteins.

[36]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.