A Deep Semi-Supervised Framework for Accurate Modelling of Orphan Sequences

Predicting the secondary structure of a single protein sequence in the absence of homology information has remained a challenge for several decades. Although not as performant as their homology-based counterparts, single-sequence methods are not constrained by the requirement of evolutionary information. More accurate single-sequence approaches have the potential to improve structural modelling across the vast majority of sequence space, especially in areas of great scientific interest like viral proteins, the "dark proteome", and de novo protein design. Here we introduce S4PRED, the successor to the open-source PSIPRED-Single method, which utilizes semi-supervised learning to achieve a Q3 score of 75.3% on the standard CB513 test set, taking only single sequences as input. Not only does this result represent a leap in performance for single-sequence methods, it also provides a blueprint for the development of future tools, beyond waiting for larger structural datasets and more powerful neural networks.

[1]  Debora S Marks,et al.  Deep generative models of genetic variation capture the effects of mutations , 2018, Nature Methods.

[2]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[3]  David T Jones,et al.  Prediction of interresidue contacts with DeepMetaPSICOV in CASP13 , 2019, Proteins.

[4]  Satoru Hayamizu,et al.  Prediction of protein secondary structure by the hidden Markov model , 1993, Comput. Appl. Biosci..

[5]  Ian Sillitoe,et al.  CATH: expanding the horizons of structure-based functional annotations for genome sequences , 2018, Nucleic Acids Res..

[6]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[7]  J. Plojhar [20 YEARS]. , 1965, Casopis lekaru ceskych.

[8]  Gianluca Pollastri,et al.  Porter 5: state-of-the-art ab initio prediction of protein secondary structure in 3 and 8 classes , 2018, bioRxiv.

[9]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[10]  Jaswinder Singh,et al.  Single‐sequence‐based prediction of protein secondary structures and solvent accessibility by deep whole‐sequence learning , 2018, J. Comput. Chem..

[11]  Yaoqi Zhou,et al.  Improving prediction of protein secondary structure, backbone angles, solvent accessibility and contact numbers by using predicted contact maps and an ensemble of recurrent and residual convolutional neural networks , 2018, Bioinform..

[12]  D. Baker,et al.  Principles for designing ideal protein structures , 2012, Nature.

[13]  P. Argos,et al.  Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence. , 1996, Protein engineering.

[14]  Kevin K. Yang,et al.  Machine-learning-guided directed evolution for protein engineering , 2018, Nature Methods.

[15]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[16]  George M. Church,et al.  Unified rational protein engineering with sequence-based deep representation learning , 2019, Nature Methods.

[17]  D. Tegunov,et al.  Structure of SWI/SNF chromatin remodeler RSC bound to a nucleosome , 2019, Nature.

[18]  Daniel W. A. Buchan,et al.  The PSIPRED Protein Analysis Workbench: 20 years on , 2019, Nucleic Acids Res..

[19]  C. Spahn,et al.  Insights into the assembly and activation of the microtubule nucleator γ-TuRC , 2019, Nature.

[20]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[21]  J. Söding,et al.  Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold , 2018, bioRxiv.

[22]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[23]  Kuldip K. Paliwal,et al.  Sixty-five years of the long march in protein secondary structure prediction: the final stretch? , 2016, Briefings Bioinform..

[24]  David Berthelot,et al.  FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence , 2020, NeurIPS.

[25]  Demis Hassabis,et al.  Improved protein structure prediction using potentials from deep learning , 2020, Nature.

[26]  B. Rost,et al.  Conservation and prediction of solvent accessibility in protein families , 1994, Proteins.

[27]  Gianluca Pollastri,et al.  Deeper Profiles and Cascaded Recurrent and Convolutional Neural Networks for state-of-the-art Protein Secondary Structure Prediction , 2019, Scientific Reports.

[28]  Jo Glanville,et al.  20 Years On , 2009 .

[29]  Johannes Söding,et al.  MMseqs2: sensitive protein sequence searching for the analysis of massive data sets , 2017, bioRxiv.

[30]  G J Barton,et al.  Evaluation and improvement of multiple sequence methods for protein secondary structure prediction , 1999, Proteins.

[31]  Burkhard Rost,et al.  Modeling aspects of the language of life through transfer-learning protein sequences , 2019, BMC Bioinformatics.

[32]  Joarder Kamruzzaman,et al.  Combining segmental semi-Markov models with neural networks for protein secondary structure prediction , 2009, Neurocomputing.

[33]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[34]  David T. Jones,et al.  Getting the most from PSI-BLAST. , 2002, Trends in biochemical sciences.

[35]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[36]  M. Levitt Nature of the protein universe , 2009, Proceedings of the National Academy of Sciences.

[37]  Christian Cole,et al.  The Jpred 3 secondary structure prediction server , 2008, Nucleic Acids Res..

[38]  M. Uschold,et al.  Methods and applications , 1953 .

[39]  David T Jones,et al.  Recent developments in deep learning applied to protein structure prediction , 2019, Proteins.

[40]  Dong-Hyun Lee,et al.  Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks , 2013 .

[41]  A Keith Dunker,et al.  DescribePROT: database of amino acid-level protein structure and function predictions , 2020, Nucleic Acids Res..

[42]  Maria Jesus Martin,et al.  SIFTS: updated Structure Integration with Function, Taxonomy and Sequences resource allows 40-fold increase in coverage of structure-based annotations for proteins , 2018, Nucleic Acids Res..

[43]  R. Edwards,et al.  Viral metagenomics , 2005, Nature Reviews Microbiology.

[44]  Daniel-Adriano Silva,et al.  Essentials of de novo protein design: Methods and applications , 2018, WIREs Computational Molecular Science.

[45]  Gianluca Pollastri,et al.  Porter, PaleAle 4.0: high-accuracy prediction of protein secondary structure and relative solvent accessibility , 2013, Bioinform..

[46]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[47]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[48]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[49]  Georgios A. Pavlopoulos,et al.  Protein structure determination using metagenome sequence data , 2017, Science.

[50]  Maria Jesus Martin,et al.  Uniclust databases of clustered and deeply annotated protein sequences and alignments , 2016, Nucleic Acids Res..

[51]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[52]  David T. Jones,et al.  Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints , 2018, Nature Communications.

[53]  David T Jones,et al.  Setting the standards for machine learning in biology , 2019, Nature Reviews Molecular Cell Biology.

[54]  Geoffrey J. Barton,et al.  JPred : a consensus secondary structure prediction server , 1999 .

[55]  Douglas L. Brutlag,et al.  Bayesian Segmentation of Protein Secondary Structure , 2000, J. Comput. Biol..

[56]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[57]  D. Baker,et al.  Coupled prediction of protein secondary and tertiary structure , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[58]  Silvio C. E. Tosatto,et al.  The Pfam protein families database in 2019 , 2018, Nucleic Acids Res..

[59]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[60]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[61]  F. Rohwer,et al.  Metagenomics and future perspectives in virus discovery , 2012, Current Opinion in Virology.

[62]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[63]  Yücel Altunbasak,et al.  Protein secondary structure prediction for a single-sequence using hidden semi-Markov models , 2006, BMC Bioinformatics.

[64]  David Berthelot,et al.  MixMatch: A Holistic Approach to Semi-Supervised Learning , 2019, NeurIPS.

[65]  Peter B. McGarvey,et al.  UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches , 2014, Bioinform..

[66]  Carolyn R. Bertozzi,et al.  Methods and Applications , 2009 .

[67]  B. Rost Review: protein secondary structure prediction continues to rise. , 2001, Journal of structural biology.

[68]  The UniProt Consortium,et al.  UniProt: a worldwide hub of protein knowledge , 2018, Nucleic Acids Res..

[69]  B. Rost,et al.  Unexpected features of the dark proteome , 2015, Proceedings of the National Academy of Sciences.

[70]  Zhen Li,et al.  Protein Secondary Structure Prediction Using Cascaded Convolutional and Recurrent Neural Networks , 2016, IJCAI.

[71]  Guoli Wang,et al.  PISCES: a protein sequence culling server , 2003, Bioinform..

[72]  David S. Goodsell,et al.  RCSB Protein Data Bank: biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy , 2018, Nucleic Acids Res..