Next-Step Conditioned Deep Convolutional Neural Networks Improve Protein Secondary Structure Prediction

Recently developed deep learning techniques have significantly improved the accuracy of various speech and image recognition systems. In this paper we show how to adapt some of these techniques to create a novel chained convolutional architecture with next-step conditioning for improving performance on protein sequence prediction problems. We explore its value by demonstrating its ability to improve performance on eight-class secondary structure prediction. We first establish a state-of-the-art baseline by adapting recent advances in convolutional neural networks which were developed for vision tasks. This model achieves 70.0% per amino acid accuracy on the CB513 benchmark dataset without use of standard performance-boosting techniques such as ensembling or multitask learning. We then improve upon this state-of-the-art result using a novel chained prediction approach which frames the secondary structure prediction as a next-step prediction problem. This sequential model achieves 70.3% Q8 accuracy on CB513 with a single model; an ensemble of these models produces 71.4% Q8 accuracy on the same test set, improving upon the previous overall state of the art for the eight-class secondary structure problem. Our models are implemented using TensorFlow, an open-source machine learning software library available at TensorFlow.org; we aim to release the code for these experiments as part of the TensorFlow repository.

[1]  R. Schulz,et al.  Protein Structure Prediction , 2020, Methods in Molecular Biology.

[2]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Osmar Norberto de Souza,et al.  Protein Structure, Modelling and Applications , 2007 .

[4]  Jian Zhou,et al.  Deep Supervised and Convolutional Generative Stochastic Network for Protein Secondary Structure Prediction , 2014, ICML.

[5]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[7]  Ying Xu,et al.  A historical perspective of template-based protein structure prediction. , 2008, Methods in molecular biology.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Zhen Li,et al.  Protein Secondary Structure Prediction Using Cascaded Convolutional and Recurrent Neural Networks , 2016, IJCAI.

[10]  Navdeep Jaitly,et al.  Chained Predictions Using Convolutional Neural Networks , 2016, ECCV.

[11]  Yanjun Qi,et al.  MUST-CNN: A Multilayer Shift-and-Stitch Deep Convolutional Architecture for Sequence-Based Protein Structure Prediction , 2016, AAAI.

[12]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[13]  Ole Winther,et al.  Protein Secondary Structure Prediction with Long Short Term Memory Networks , 2014, ArXiv.

[14]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[15]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[16]  Jian Peng,et al.  Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields , 2015, Scientific Reports.

[17]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[18]  K. Dill,et al.  The Protein Folding Problem , 1993 .

[19]  Haruki Nakamura,et al.  Announcing the worldwide Protein Data Bank , 2003, Nature Structural Biology.

[20]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[21]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[22]  D. Kihara The effect of long‐range interactions on the secondary structure formation of proteins , 2005, Protein science : a publication of the Protein Society.

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[25]  C. Anfinsen Principles that govern the folding of protein chains. , 1973, Science.