DeepCrystal: A Deep Learning Framework for Sequence-based Protein Crystallization Prediction

Protein structure determination has primarily been performed using X-ray crystallography. To overcome the expensive cost, high attrition rate and series of trial-and-error settings, many in-silico methods have been developed to predict crystallization propensities of proteins based on their sequences. However, majority of these methods build predictors by extracting features from protein sequences which is computationally expensive and can potentially explode the feature space. We propose, DeepCrystal, a deep learning framework for sequence-based protein crystallization prediction. It uses deep learning to identify proteins which can produce diffraction quality crystals without the need to manually engineer additional biochemical and structural features from sequence. Our model is based on Convolutional Neural Networks (CNNs) which can exploit frequently occurring k-mers and sets of k-mers from the protein sequences to discriminate diffraction quality crystals from non-crystallizable ones. Our model outperforms previous sequence-based protein crystallization predictors in terms of recall, F-score, accuracy and MCC on three independent test sets. DeepCrystal achieves an average improvement of 1.4 %, 12.1% in recall, when compared to its closest competitors, Crysalis II and Crysf respectively. In addition, DeepCrystal attains an average improvement of 2.1%, 6.0% for F-score, 1.9%, 3.9% for accuracy and 3.8%, 7.0% for MCC respectively w.r.t. Crysalis II and Crysf on independent test sets. The standalone source code and models are available at https://github.com/elbasir/DeepCrystal and a web-server is also available at https://deeplearning-protein.qcri.org.

[1]  Jun Hu,et al.  TargetCrys: protein crystallization prediction by fusing multi-view features with two-layered SVM , 2016, Amino Acids.

[2]  Huilin Wang,et al.  Critical evaluation of bioinformatics tools for the prediction of protein crystallization propensity , 2017, Briefings Bioinform..

[3]  Adam Godzik,et al.  Improving the chances of successful protein structure determination with a random forest classifier. , 2014, Acta crystallographica. Section D, Biological crystallography.

[4]  Chen Wang,et al.  fDETECT webserver: fast predictor of propensity for protein production, purification, and crystallization , 2017, BMC Bioinformatics.

[5]  Liubin Feng,et al.  Crysalis: an integrated server for computational analysis and design of protein crystallization , 2016, Scientific Reports.

[6]  Scott Dick,et al.  CRYSTALP2: sequence-based protein crystallization propensity prediction , 2009, BMC Structural Biology.

[7]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[9]  Zhen Li,et al.  Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model , 2016, bioRxiv.

[10]  Ehsaneddin Asgari,et al.  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics , 2015, PloS one.

[11]  Jiangning Song,et al.  Survey of Predictors of Propensity for Protein Production and Crystallization with Application to Predict Resolution of Crystal Structures. , 2017, Current protein & peptide science.

[12]  Lukasz A. Kurgan,et al.  Sequence-based prediction of protein crystallization, purification and production propensity , 2011, Bioinform..

[13]  Jiangning Song,et al.  PredPPCrys: Accurate Prediction of Sequence Cloning, Protein Production, Purification and Crystallization Propensity from Protein Sequences Using Multi-Step Heterogeneous Feature Fusion and Selection , 2014, PloS one.

[14]  Thomas C Terwilliger,et al.  Lessons from structural genomics. , 2009, Annual review of biophysics.