Target selection for structural genomics based on combining fold recognition and crystallisation prediction methods: application to the human proteome

The objective of this study is to automatically identify regions of the human proteome that are suitable for 3D structure determination by X-ray crystallography and to annotate them according to their likelihood to produce diffraction quality crystals. The results provide a powerful tool for structural genomics laboratories who wish to select human proteins based on the statistical likelihood of crystallisation success. Combining fold recognition and crystallisation prediction algorithms enables the efficient calculation of the crystallisability of the entire human proteome. This novel study estimates that there are approximately 40,000 crystallisable regions in the human proteome. Currently, only 15% of these regions (approx. 6,000 sequences) have been solved to at least 95% sequence identity. The remaining unsolved regions have been categorised into 5 crystallisation classes and an integral membrane protein (IMP) class, based on established structure prediction, crystallisation prediction and transmembrane (TM) helix prediction algorithms. Approximately 750 unsolved regions (2% of the proteome) have been identified as having a PDB fold representative (template) and an ‘optimal’ likelihood of crystallisation. At the other end of the spectrum, more than 10,500 non-IMP regions with a PDB template are classified as ‘very difficult’ to crystallise (26%) and almost 2,500 regions (6%) were predicted to contain at least 3 TM helices. The 3D-SPECS (3D Structural Proteomics Explorer with Crystallisation Scores) website contains crystallisation predictions for the entire human proteome and can be found at http://www.bioinformaticsplus.org/3dspecs.

[1]  Joshua LaBaer,et al.  PSI:Biology-materials repository: a biologist’s resource for protein expression plasmids , 2011, Journal of Structural and Functional Genomics.

[2]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[3]  Geoffrey J Barton,et al.  XANNpred: Neural nets that predict the propensity of a protein to yield diffraction-quality crystals , 2010, Proteins.

[4]  David A. Lee,et al.  PSI-2: structural genomics to cover protein domain family space. , 2009, Structure.

[5]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[6]  Udo Heinemann,et al.  Structural genomics in Europe: Slow start, strong finish? , 2000, Nature Structural Biology.

[7]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[8]  Ganesan Pugalenthi,et al.  SVMCRYS: an SVM approach for the prediction of protein crystallization propensity from protein sequence. , 2010, Protein and peptide letters.

[9]  Mark A. Girolami,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btn055 Sequence analysis ParCrys: a Parzen window density estimation approach , 2022 .

[10]  Sitao Wu,et al.  MUSTER: Improving protein sequence profile–profile alignments by using multiple sources of structure information , 2008, Proteins.

[11]  James E. Bray,et al.  Gene3D: structural assignments for the biologist and bioinformaticist alike , 2003, Nucleic Acids Res..

[12]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[13]  Anastassis Perrakis,et al.  ProteinCCD: enabling the design of protein truncation constructs for expression and crystallization experiments , 2009, Nucleic Acids Res..

[14]  Zhaohui Sun,et al.  Domain view: a web tool for protein domain visualization and analysis , 2010, Journal of Structural and Functional Genomics.

[15]  Benjamin A. Shoemaker,et al.  CDD: a database of conserved domain alignments with links to domain three-dimensional structure , 2002, Nucleic Acids Res..

[16]  K. Gunsalus,et al.  Protein production and purification , 2008, Nature Methods.

[17]  Leszek Rychlewski,et al.  The challenge of protein structure determination—lessons from structural genomics , 2007, Protein science : a publication of the Protein Society.

[18]  Thomas C. Terwilliger,et al.  Structural genomics in North America , 2000, Nature Structural Biology.

[19]  Dmitrij Frishman,et al.  Will my protein crystallize? A sequence‐based predictor , 2005, Proteins.

[20]  Brian D. Marsden,et al.  High-throughput production of human proteins for crystallization: The SGC experience , 2010, Journal of structural biology.

[21]  J. S. Sodhi,et al.  Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. , 2004, Journal of molecular biology.

[22]  Y. Matsuo,et al.  Structural genomics projects in Japan. , 2000, Progress in biophysics and molecular biology.

[23]  R D Klausner,et al.  The mammalian gene collection. , 1999, Science.

[24]  C. Chothia,et al.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. , 2001, Journal of molecular biology.

[25]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[26]  David A. Lee,et al.  Identification and distribution of protein families in 120 completed genomes using Gene3D , 2005, Proteins.

[27]  B. Rost,et al.  Understanding the physical properties that control protein crystallization by analysis of large-scale experimental data , 2009, Nature Biotechnology.

[28]  Andrzej Joachimiak,et al.  High-throughput crystallography for structural genomics. , 2009, Current opinion in structural biology.

[29]  Liam J. McGuffin,et al.  The Genomic Threading Database: a comprehensive resource for structural annotations of the genomes from key organisms , 2004, Nucleic Acids Res..

[30]  Scott Dick,et al.  CRYSTALP2: sequence-based protein crystallization propensity prediction , 2009, BMC Structural Biology.

[31]  Lukasz Kurgan,et al.  Prediction of protein crystallization using collocation of amino acid pairs. , 2007, Biochemical and biophysical research communications.

[32]  Debasis Dash,et al.  HGVbaseG2P: a central genetic association database , 2008, Nucleic Acids Res..

[33]  Lukasz A. Kurgan,et al.  Sequence-based prediction of protein crystallization, purification and production propensity , 2011, Bioinform..

[34]  David S. Goodsell,et al.  The RCSB Protein Data Bank: redesigned web site and web services , 2010, Nucleic Acids Res..

[35]  Liam J McGuffin,et al.  Targeting novel folds for structural genomics , 2002, Proteins.

[36]  John D. Westbrook,et al.  The Structural Biology Knowledgebase: a portal to protein structures, sequences, functions, and methods , 2011, Journal of Structural and Functional Genomics.

[37]  David T. Jones,et al.  pGenTHREADER and pDomTHREADER: new methods for improved protein fold recognition and superfamily discrimination , 2009, Bioinform..

[38]  Rebecca Page,et al.  Protein biophysical properties that correlate with crystallization success in Thermotoga maritima: maximum clustering strategy for structural genomics. , 2004, Journal of molecular biology.

[39]  Erik L. L. Sonnhammer,et al.  A Hidden Markov Model for Predicting Transmembrane Helices in Protein Sequences , 1998, ISMB.

[40]  Marcin J Mizianty,et al.  CRYSpred: accurate sequence-based protein crystallization propensity prediction using sequence-derived structural characteristics. , 2012, Protein and peptide letters.

[41]  Lukasz Kurgan,et al.  Meta prediction of protein crystallization propensity. , 2009, Biochemical and biophysical research communications.

[42]  Andrzej Joachimiak,et al.  Predicting protein crystallization propensity from protein sequence , 2010, Journal of Structural and Functional Genomics.

[43]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[44]  Geoffrey J Barton,et al.  A normalised scale for structural genomics target ranking: The OB‐Score , 2006, FEBS letters.

[45]  John D. Westbrook,et al.  TargetDB: a target registration database for structural genomics projects , 2004, Bioinform..