Improving protein secondary structure prediction by deep language models and transformer networks

Protein secondary structure prediction is useful for many applications. It can be considered a language translation problem, i.e., translating a sequence of 20 different amino acids into a sequence of secondary structure symbols (e.g., alpha helix, beta strand, and coil). Here, we develop a novel protein secondary structure predictor called TransPross based on the transformer network and attention mechanism widely used in natural language processing to directly extract the evolutionary information from the protein language (i.e., raw multiple sequence alignment (MSA) of a protein) to predict the secondary structure. The method is different from traditional methods that first generate a MSA and then calculate expert-curated statistical profiles from the MSA as input. The attention mechnism used by TransPross can effectively capture long-range residue-residue interactions in protein sequences to predict secondary structures. Benchmarked on several datasets, TransPross outperforms the state-of-art methods. Moreover, our experiment shows that the prediction accuracy of TransPross positively correlates with the depth of MSAs and it is able to achieve the average prediction accuracy (i.e., Q3 score) above 80% for hard targets with few homologous sequences in their MSAs. TransPross is freely available at https://github.com/BioinfoMachineLearning/TransPro

[1]  R. Laskowski,et al.  AlphaFold heralds a data-driven revolution in biology and medicine , 2021, Nature Medicine.

[2]  Oriol Vinyals,et al.  Highly accurate protein structure prediction with AlphaFold , 2021, Nature.

[3]  Jianwei Shuai,et al.  Protein Secondary Structure Prediction With a Reductive Deep Learning Method , 2021, Frontiers in Bioengineering and Biotechnology.

[4]  Jaspreet Singh,et al.  SPOT-1D-Single: improving the single-sequence-based prediction of protein secondary structure, backbone angles, solvent accessibility and half-sphere exposures using a large training set and ensembled deep learning , 2021, Bioinform..

[5]  John F. Canny,et al.  MSA Transformer , 2021, bioRxiv.

[6]  Yihui Liu,et al.  OCLSTM: Optimized convolutional and long short-term memory neural network model for protein secondary structure prediction. , 2021, PloS one.

[7]  Tom Sercu,et al.  Transformer protein language models are unsupervised structure learners , 2020, bioRxiv.

[8]  B. Rost,et al.  ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing , 2020, bioRxiv.

[9]  Ananthan Nambiar,et al.  Transforming the Language of Life: Transformer Neural Networks for Protein Prediction Tasks , 2020, bioRxiv.

[10]  David T. Jones,et al.  Improved protein structure prediction using potentials from deep learning , 2020, Nature.

[11]  Yang Zhang,et al.  DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins , 2019, Bioinform..

[12]  Jie Hou,et al.  Analysis of several key factors influencing deep learning-based inter-residue contact prediction , 2019, Bioinform..

[13]  Jianlin Cheng,et al.  DNSS2: improved ab initio protein secondary structure prediction using advanced deep learning architectures , 2019, bioRxiv.

[14]  J. Söding,et al.  Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold , 2018, bioRxiv.

[15]  Chao Fang,et al.  MUFOLD‐SS: New deep inception‐inside‐inception networks for protein secondary structure prediction , 2018, Proteins.

[16]  Amedeo Caflisch,et al.  Protein structure-based drug design: from docking to molecular dynamics. , 2018, Current opinion in structural biology.

[17]  Johannes Söding,et al.  MMseqs2: sensitive protein sequence searching for the analysis of massive data sets , 2017, bioRxiv.

[18]  Kuldip K. Paliwal,et al.  Capturing non‐local interactions by long short‐term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility , 2017, Bioinform..

[19]  Badri Adhikari,et al.  Improved protein structure reconstruction using secondary structures, contacts at higher distance thresholds, and non-contacts , 2017, BMC Bioinformatics.

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jian Peng,et al.  Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields , 2015, Scientific Reports.

[23]  Yang Zhang,et al.  Protein Structure and Function Prediction Using I‐TASSER , 2015, Current protocols in bioinformatics.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Jianlin Cheng,et al.  A Deep Learning Network Approach to ab initio Protein Secondary Structure Prediction , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[26]  Pierre Baldi,et al.  SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity , 2014, Bioinform..

[27]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[28]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[29]  Francesco Bettella,et al.  Protein Secondary Structure Prediction with SPARROW , 2012, J. Chem. Inf. Model..

[30]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[31]  Sean R. Eddy,et al.  Hidden Markov model speed heuristic and iterative HMM search procedure , 2010, BMC Bioinformatics.

[32]  Julian Lee,et al.  Measures for the assessment of fuzzy predictions of protein secondary structure , 2006, Proteins.

[33]  A. Anderson The process of structure-based drug design. , 2003, Chemistry & biology.

[34]  G J Barton,et al.  Application of multiple sequence alignment profiles to improve protein secondary structure prediction , 2000, Proteins.

[35]  J. Thompson,et al.  Multiple sequence alignment with Clustal X. , 1998, Trends in biochemical sciences.

[36]  C L Verlinde,et al.  Structure-based drug design: progress, results and challenges. , 1994, Structure.

[37]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[38]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[39]  L. Pauling,et al.  Configurations of Polypeptide Chains With Favored Orientations Around Single Bonds: Two New Pleated Sheets. , 1951, Proceedings of the National Academy of Sciences of the United States of America.

[40]  L. Pauling,et al.  The structure of proteins; two hydrogen-bonded helical configurations of the polypeptide chain. , 1951, Proceedings of the National Academy of Sciences of the United States of America.