Protein Fold Recognition From Sequences Using Convolutional and Recurrent Neural Networks

The identification of a protein fold type from its amino acid sequence provides important insights about the protein 3D structure. In this paper, we propose a deep learning architecture that can process protein residue-level features to address the protein fold recognition task. Our neural network model combines 1D-convolutional layers with gated recurrent unit (GRU) layers. The GRU cells, as recurrent layers, cope with the processing issues associated to the highly variable protein sequence lengths and so extract a fold-related embedding of fixed size for each protein domain. These embeddings are then used to perform the pairwise fold recognition task, which is based on transferring the fold type of the most similar template structure. We compare our model with several template-based and deep learning-based methods from the state-of-the-art. The evaluation results over the well-known LINDAHL and SCOP_TEST sets, along with a proposed LINDAHL test set updated to SCOP 1.75, show that our embeddings perform significantly better than these methods, specially at the fold level. Supplementary material, source code and trained models are available at http://sigmat.ugr.es/~amelia/CNN-GRU-RF+/.

[1]  SödingJohannes Protein homology detection by HMM--HMM comparison , 2005 .

[2]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[3]  Pierre Baldi,et al.  A machine learning information retrieval approach to protein fold recognition. , 2006, Bioinformatics.

[4]  Hongbo Mu,et al.  An ensemble approach to protein fold classification by integration of template‐based assignment and support vector machine classifier , 2016, Bioinform..

[5]  James G. Lyons,et al.  Advancing the Accuracy of Protein Fold Recognition by Utilizing Profiles From Hidden Markov Models , 2015, IEEE Transactions on NanoBioscience.

[6]  A V Finkelstein,et al.  The classification and origins of protein folding patterns. , 1990, Annual review of biochemistry.

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Jianyi Yang,et al.  Improving taxonomy‐based protein fold recognition by using global and local features , 2011, Proteins.

[9]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[10]  Xiaozhao Fang,et al.  Protein fold recognition based on multi-view modeling , 2019, Bioinform..

[11]  Zhiyong Wang,et al.  MRFalign: Protein Homology Detection through Alignment of Markov Random Fields , 2014, PLoS Comput. Biol..

[12]  Q. Zou,et al.  Recent Progress in Machine Learning-Based Methods for Protein Fold Recognition , 2016, International journal of molecular sciences.

[13]  Bin Liu,et al.  Fold-LTR-TCP: protein fold recognition based on triadic closure principle , 2019, Briefings Bioinform..

[14]  Junjie Chen,et al.  A comprehensive review and comparison of different computational methods for protein remote homology detection , 2018, Briefings Bioinform..

[15]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[16]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[17]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[18]  R Dustin Schaeffer,et al.  Protein folds and protein folding. , 2011, Protein engineering, design & selection : PEDS.

[19]  Jie Hou,et al.  DeepSF: deep convolutional neural network for mapping protein sequences to folds , 2017, Bioinform..

[20]  E. Lindahl,et al.  Identification of related proteins on family, superfamily and fold level. , 2000, Journal of molecular biology.

[21]  Pierre Baldi,et al.  SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity , 2014, Bioinform..

[22]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[23]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[24]  Juan A Morales-Cordovilla,et al.  Protein alignment based on higher order conditional random fields for template-based modeling , 2018, PloS one.

[25]  Ole Winther,et al.  Protein Secondary Structure Prediction with Long Short Term Memory Networks , 2014, ArXiv.

[26]  Steven E. Brenner,et al.  SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures , 2013, Nucleic Acids Res..

[27]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[28]  Michael Levitt,et al.  On the universe of protein folds. , 2013, Annual review of biophysics.

[29]  Mohammed AlQuraishi,et al.  End-to-end differentiable learning of protein structure , 2018, bioRxiv.

[30]  Yaoqi Zhou,et al.  Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted one-dimensional structural properties of query and corresponding native properties of templates , 2011, Bioinform..

[31]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[32]  Bin Liu,et al.  MotifCNN-fold: protein fold recognition based on fold-specific features extracted by motif-based convolutional neural networks , 2019, Briefings Bioinform..

[33]  Kuldip K. Paliwal,et al.  Accurate prediction of protein contact maps by coupling residual two-dimensional bidirectional long short-term memory with convolutional neural networks , 2018, Bioinform..

[34]  Ole Winther,et al.  Convolutional LSTM Networks for Subcellular Localization of Proteins , 2015, AlCoB.

[35]  Chao Wang,et al.  Improving protein fold recognition by extracting fold-specific features from predicted residue–residue contacts , 2017, Bioinform..

[36]  Bin Liu,et al.  DeepSVM-fold: protein fold recognition by combining support vector machines and pairwise sequence similarity scores generated by deep learning networks , 2019, Briefings Bioinform..

[37]  Klaus Obermayer,et al.  Fast model-based protein homology detection without alignment , 2007, Bioinform..

[38]  Yaoqi Zhou,et al.  Improving prediction of protein secondary structure, backbone angles, solvent accessibility and contact numbers by using predicted contact maps and an ensemble of recurrent and residual convolutional neural networks , 2018, Bioinform..

[39]  Taeho Jo,et al.  Improving protein fold recognition by random forest , 2014, BMC Bioinformatics.

[40]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[41]  Jinyan Li,et al.  Prediction of 8-state protein secondary structures by a novel deep learning architecture , 2018, BMC Bioinformatics.

[42]  Yang Zhang,et al.  Detecting distant-homology protein structures by aligning deep neural-network based contact maps , 2019, PLoS Comput. Biol..

[43]  Jun Gao,et al.  ProFold: Protein Fold Classification with Additional Structural Features and a Novel Ensemble Classifier , 2016, BioMed research international.

[44]  Ying Xu,et al.  Raptor: Optimal Protein Threading by Linear Programming , 2003, J. Bioinform. Comput. Biol..

[45]  Jian Peng,et al.  Boosting Protein Threading Accuracy , 2009, RECOMB.

[46]  Y-h. Taguchi,et al.  Application of amino acid occurrence for discriminating different folding types of globular proteins , 2007, BMC Bioinformatics.

[47]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[48]  Shuigeng Zhou,et al.  A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation , 2009, Bioinform..

[49]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[50]  D. T. Jones,et al.  A new approach to protein fold recognition , 1992, Nature.

[51]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[52]  Jian Peng,et al.  A conditional neural fields model for protein threading , 2012, Bioinform..

[53]  D T Jones,et al.  A systematic comparison of protein structure classifications: SCOP, CATH and FSSP. , 1999, Structure.

[54]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[55]  Kuo-Chen Chou,et al.  Ensemble classifier for protein fold pattern recognition , 2006, Bioinform..

[56]  Taeho Jo,et al.  Improving Protein Fold Recognition by Deep Learning Networks , 2015, Scientific Reports.

[57]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..