Predicting protein crystallization propensity from protein sequence

The high-throughput structure determination pipelines developed by structural genomics programs offer a unique opportunity for data mining. One important question is how protein properties derived from a primary sequence correlate with the protein’s propensity to yield X-ray quality crystals (crystallizability) and 3D X-ray structures. A set of protein properties were computed for over 1,300 proteins that expressed well but were insoluble, and for ~720 unique proteins that resulted in X-ray structures. The correlation of the protein’s iso-electric point and grand average hydropathy (GRAVY) with crystallizability was analyzed for full length and domain constructs of protein targets. In a second step, several additional properties that can be calculated from the protein sequence were added and evaluated. Using statistical analyses we have identified a set of the attributes correlating with a protein’s propensity to crystallize and implemented a Support Vector Machine (SVM) classifier based on these. We have created applications to analyze and provide optimal boundary information for query sequences and to visualize the data. These tools are available via the web site http://bioinformatics.anl.gov/cgi-bin/tools/pdpredictor.

[1]  Stephen K. Burley,et al.  High-throughput Limited Proteolysis/Mass Spectrometry for Protein Domain Elucidation , 2005, Journal of Structural and Functional Genomics.

[2]  Mark Gerstein,et al.  SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics , 2001, Nucleic Acids Res..

[3]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[4]  J. Celis,et al.  Reference points for comparisons of two‐dimensional maps of proteins from different human cell types defined in a pH scale where isoelectric points correlate with polypeptide compositions , 1994, Electrophoresis.

[5]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[6]  S. Rackovsky,et al.  Differential geometry and polymer conformation. 4. Conformational and nucleation properties of individual amino acids , 1982 .

[7]  Lukasz Kurgan,et al.  Prediction of protein crystallization using collocation of amino acid pairs. , 2007, Biochemical and biophysical research communications.

[8]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[9]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[10]  Mark A. Girolami,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btn055 Sequence analysis ParCrys: a Parzen window density estimation approach , 2022 .

[11]  Dmitrij Frishman,et al.  Will my protein crystallize? A sequence‐based predictor , 2005, Proteins.

[12]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[13]  Carol S. Giometti,et al.  GELBANK: a database of annotated two-dimensional gel electrophoresis patterns of biological systems with completed genomes , 2004, Nucleic Acids Res..

[14]  A Elofsson,et al.  Turns in transmembrane helices: determination of the minimal length of a "helical hairpin" and derivation of a fine-grained turn propensity scale. , 1999, Journal of molecular biology.

[15]  A. Dong,et al.  In situ proteolysis for protein crystallization and structure determination , 2007, Nature Methods.

[16]  J. Richardson,et al.  Amino acid preferences for specific locations at the ends of alpha helices. , 1988, Science.

[17]  V. Muñoz,et al.  Intrinsic secondary structure propensities of the amino acids, using statistical ϕ–ψ matrices: Comparison with experimental scales , 1994 .

[18]  Sean R. Eddy,et al.  Maximum Discrimination Hidden Markov Models of Sequence Consensus , 1995, J. Comput. Biol..

[19]  Youngchang Kim,et al.  Large-scale evaluation of protein reductive methylation for improving protein crystallization , 2008, Nature Methods.

[20]  Erik L. L. Sonnhammer,et al.  Advantages of combined transmembrane topology and signal peptide prediction—the Phobius web server , 2007, Nucleic Acids Res..

[21]  A. Joachimiak,et al.  Crystal structures of delta1-pyrroline-5-carboxylate reductase from human pathogens Neisseria meningitides and Streptococcus pyogenes. , 2005, Journal of molecular biology.

[22]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[23]  David Eisenberg,et al.  Toward rational protein crystallization: A Web server for the design of crystallizable protein variants , 2007, Protein science : a publication of the Protein Society.

[24]  H. Scheraga,et al.  Statistical mechanical treatment of protein conformation. II. A three-state model for specific-sequence copolymers of amino acids. , 1976, Macromolecules.

[25]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[26]  C. Koth,et al.  Use of limited proteolysis to identify protein domains suitable for structural analysis. , 2003, Methods in enzymology.

[27]  Christine A Orengo,et al.  Target selection for structural genomics: an overview. , 2008, Methods in molecular biology.

[28]  Geoffrey J Barton,et al.  A normalised scale for structural genomics target ranking: The OB‐Score , 2006, FEBS letters.

[29]  H. Scheraga,et al.  Statistical mechanical treatment of protein conformation. 5. A multistate model for specific-sequence copolymers of amino acids. , 1977, Macromolecules.

[30]  P Argos,et al.  Protein secondary structure. Studies on the limits of prediction accuracy. , 2009, International journal of peptide and protein research.

[31]  S. Rackovsky,et al.  Differential Geometry and Polymer Conformation. 1. Comparison of Protein Conformations1a,b , 1978 .

[32]  Leszek Rychlewski,et al.  The challenge of protein structure determination—lessons from structural genomics , 2007, Protein science : a publication of the Protein Society.

[33]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[34]  Rebecca Page,et al.  Protein biophysical properties that correlate with crystallization success in Thermotoga maritima: maximum clustering strategy for structural genomics. , 2004, Journal of molecular biology.

[35]  P. Y. Chou,et al.  Prediction of the secondary structure of proteins from their amino acid sequence. , 2006 .

[36]  H. Szurmant,et al.  Extracytoplasmic PAS-Like Domains Are Common in Signal Transduction Proteins , 2009, Journal of bacteriology.

[37]  Bernard F. Buxton,et al.  The DISOPRED server for the prediction of protein disorder , 2004, Bioinform..

[38]  Christopher J. Oldfield,et al.  Addressing the intrinsic disorder bottleneck in structural proteomics , 2005, Proteins.

[39]  Piero Fariselli,et al.  A sequence-profile-based HMM for predicting and discriminating beta barrel membrane proteins , 2002, ISMB.

[40]  C. Chothia Structural invariants in protein folding , 1975, Nature.

[41]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[42]  B. Rost,et al.  Understanding the physical properties that control protein crystallization by analysis of large-scale experimental data , 2009, Nature Biotechnology.

[43]  F. Collart,et al.  A new vector for high-throughput, ligation-independent cloning encoding a tobacco etch virus protease cleavage site. , 2002, Protein expression and purification.

[44]  Mark Gerstein,et al.  SPINE 2: a system for collaborative structural proteomics within a federated database framework. , 2003, Nucleic acids research.

[45]  P. Ponnuswamy,et al.  Hydrophobic packing and spatial arrangement of amino acid residues in globular proteins. , 1980, Biochimica et biophysica acta.

[46]  G von Heijne,et al.  A turn propensity scale for transmembrane helices. , 1999, Journal of molecular biology.

[47]  Minoru Kanehisa,et al.  AAindex: Amino Acid index database , 2000, Nucleic Acids Res..

[48]  Leszek Rychlewski,et al.  XtalPred: a web server for prediction of protein crystallizability , 2007, Bioinform..