Completeness in structural genomics

Structural genomics has the goal of obtaining useful, three-dimensional models of all proteins by a combination of experimental structure determination and comparative model building. We evaluate different strategies for optimizing information return on effort. The strategy that maximizes structural coverage requires about seven times fewer structure determinations compared with the strategy in which targets are selected at random. With a choice of reasonable model quality and the goal of 90% coverage, we extrapolate the estimate of the total effort of structural genomics. It would take ∼16,000 carefully selected structure determinations to construct useful atomic models for the vast majority of all proteins. In practice, unless there is global coordination of target selection, the total effort will likely increase by a factor of three. The task can be accomplished within a decade provided that selection of targets is highly coordinated and significant funding is available.

[1]  S E Brenner,et al.  Distribution of protein folds in the three superkingdoms of life. , 1999, Genome research.

[2]  Roberto Sánchez,et al.  ModBase: A database of comparative protein structure models , 1999, Bioinform..

[3]  M C Peitsch,et al.  Protein modelling for all. , 1999, Trends in biochemical sciences.

[4]  Nathan Linial,et al.  A Map of the Protein Space: An Automatic Hierarchical Classification of all Protein Sequences , 1998, ISMB.

[5]  Cathy H. Wu,et al.  ProClass Protein Family Database , 1999, Nucleic Acids Res..

[6]  S. Bryant,et al.  Critical assessment of methods of protein structure prediction (CASP): Round II , 1997, Proteins.

[7]  Ruben Recabarren,et al.  Estimating the total number of protein folds , 1999, Proteins.

[8]  M. Cotton,et al.  Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana , 1999, Nature.

[9]  David S. Eisenberg,et al.  Finding families for genomic ORFans , 1999, Bioinform..

[10]  Andrew C. R. Martin,et al.  Assessment of comparative modeling in CASP2 , 1997, Proteins.

[11]  M. Gerstein Patterns of protein‐fold usage in eight microbial genomes: A comprehensive structural census , 1998, Proteins.

[12]  R George,et al.  An exploration of the sequence of a 2.9-Mb region of the genome of Drosophila melanogaster: the Adh region. , 1999, Genetics.

[13]  A. Sali 100,000 protein structures for the biologist , 1998, Nature Structural Biology.

[14]  Martin Vingron,et al.  WWW access to the SYSTERS protein sequence cluster set , 1999, Bioinform..

[15]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[16]  Jorma Laaksonen,et al.  SOM_PAK: The Self-Organizing Map Program Package , 1996 .

[17]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[18]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[19]  Hans-Werner Mewes,et al.  The PIR-International Protein Sequence Database , 1992, Nucleic Acids Res..

[20]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its new supplement TREMBL , 1996, Nucleic Acids Res..

[21]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[22]  M Gerstein,et al.  Advances in structural genomics. , 1999, Current opinion in structural biology.

[23]  Eugen C. Buehler,et al.  Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana , 1999, Nature.

[24]  S. Brenner,et al.  Expectations from structural genomics , 2008, Protein science : a publication of the Protein Society.

[25]  Chris Sander,et al.  Dali/FSSP classification of three-dimensional protein folds , 1997, Nucleic Acids Res..

[26]  A. Sali,et al.  Structural genomics: beyond the Human Genome Project , 1999, Nature Genetics.

[27]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[28]  R. Durbin,et al.  Analysis of protein domain families in Caenorhabditis elegans. , 1997, Genomics.

[29]  Sean R. Eddy,et al.  Maximum Discrimination Hidden Markov Models of Sequence Consensus , 1995, J. Comput. Biol..

[30]  L Holm,et al.  Towards a covering set of protein family profiles. , 2000, Progress in biophysics and molecular biology.

[31]  John C. Wootton,et al.  Non-globular Domains in Protein Sequences: Automated Segmentation Using Complexity Measures , 1994, Comput. Chem..

[32]  A. Sali,et al.  Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Jérôme Gouzy,et al.  Recent improvements of the ProDom database of protein domain families , 1999, Nucleic Acids Res..

[34]  Chris Sander,et al.  Protein folds and families: sequence and structure alignments , 1999, Nucleic Acids Res..

[35]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[36]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[37]  Arne Elofsson,et al.  A comparison of sequence and structure protein domain families as a basis for structural genomics , 1999, Bioinform..

[38]  E V Koonin,et al.  Estimating the number of protein folds and families from complete genome data. , 2000, Journal of molecular biology.

[39]  J. Newman,et al.  Class‐directed structure determination: Foundation for a protein structure initiative , 1998, Protein science : a publication of the Protein Society.

[40]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[41]  U. Hobohm,et al.  Selection of representative protein data sets , 1992, Protein science : a publication of the Protein Society.

[42]  Sung-Hou Kim Shining a light on structural genomics , 1998, Nature Structural Biology.

[43]  J. Moult,et al.  Biological function made crystal clear - annotation of hypothetical proteins via structural genomics. , 2000, Current opinion in biotechnology.

[44]  R Sánchez,et al.  Advances in comparative protein-structure modelling. , 1997, Current opinion in structural biology.

[45]  Melanie E. Goward,et al.  The DNA sequence of human chromosome 22 , 1999, Nature.

[46]  C Sander,et al.  Dictionary of recurrent domains in protein structures , 1998, Proteins.

[47]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[48]  A Elofsson,et al.  Prediction of transmembrane alpha-helices in prokaryotic membrane proteins: the dense alignment surface method. , 1997, Protein engineering.

[49]  G. Montelione,et al.  A banner year for membranes , 1999, Nature Structural Biology.