A Deep Semi-Supervised Framework for Accurate Modelling of Orphan Sequences

Accurate modelling of a single orphan protein sequence in the absence of homology information has remained a challenge for several decades. Although not as performant as their homology-based counterparts, single-sequence bioinformatic methods are not constrained by the requirement of evolutionary information and so have a swathe of applications and uses. By taking a bioinformatics approach to semi-supervised machine learning we develop Profile Augmentation of Single Sequences (PASS), a simple but powerful framework for developing accurate single-sequence methods. To demonstrate the effectiveness of PASS we apply it to the mature field of secondary structure prediction. In doing so we develop S4PRED, the successor to the open-source PSIPRED-Single method, which achieves an unprecedented Q3 score of 75.3% on the standard CB513 test. PASS provides a blueprint for the development of a new generation of predictive methods, advancing our ability to model individual protein sequences.

[1]  Johannes Söding,et al.  Clustering huge protein sequence sets in linear time , 2018 .

[2]  Yaoqi Zhou,et al.  Single-sequence and profile-based prediction of RNA solvent accessibility using dilated convolutional neural network , 2020, Bioinform..

[3]  Ian Sillitoe,et al.  CATH: expanding the horizons of structure-based functional annotations for genome sequences , 2018, Nucleic Acids Res..

[4]  B. Rost,et al.  Unexpected features of the dark proteome , 2015, Proceedings of the National Academy of Sciences.

[5]  D. Baker,et al.  Principles for designing ideal protein structures , 2012, Nature.

[6]  David T Jones,et al.  Recent developments in deep learning applied to protein structure prediction , 2019, Proteins.

[7]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[8]  Guoli Wang,et al.  PISCES: a protein sequence culling server , 2003, Bioinform..

[9]  Dong-Hyun Lee,et al.  Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks , 2013 .

[10]  J. Söding,et al.  Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold , 2018, bioRxiv.

[11]  Daniel W. A. Buchan,et al.  The PSIPRED Protein Analysis Workbench: 20 years on , 2019, Nucleic Acids Res..

[12]  David Berthelot,et al.  MixMatch: A Holistic Approach to Semi-Supervised Learning , 2019, NeurIPS.

[13]  Gianluca Pollastri,et al.  Deeper Profiles and Cascaded Recurrent and Convolutional Neural Networks for state-of-the-art Protein Secondary Structure Prediction , 2019, Scientific Reports.

[14]  Robert D. Finn,et al.  MGnify: the microbiome analysis resource in 2020 , 2019, Nucleic Acids Res..

[15]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[16]  David T. Jones,et al.  Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints , 2018, Nature Communications.

[17]  David T Jones,et al.  Setting the standards for machine learning in biology , 2019, Nature Reviews Molecular Cell Biology.

[18]  M. Levitt Nature of the protein universe , 2009, Proceedings of the National Academy of Sciences.

[19]  Peter B. McGarvey,et al.  UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches , 2014, Bioinform..

[20]  Carolyn R. Bertozzi,et al.  Methods and Applications , 2009 .

[21]  Geoffrey J. Barton,et al.  JPred : a consensus secondary structure prediction server , 1999 .

[22]  B. Rost Review: protein secondary structure prediction continues to rise. , 2001, Journal of structural biology.

[23]  Christian Cole,et al.  The Jpred 3 secondary structure prediction server , 2008, Nucleic Acids Res..

[24]  Debora S Marks,et al.  Deep generative models of genetic variation capture the effects of mutations , 2018, Nature Methods.

[25]  Gianluca Pollastri,et al.  Porter, PaleAle 4.0: high-accuracy prediction of protein secondary structure and relative solvent accessibility , 2013, Bioinform..

[26]  B. Rost,et al.  Conservation and prediction of solvent accessibility in protein families , 1994, Proteins.

[27]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[28]  P. Bork,et al.  A global ocean atlas of eukaryotic genes , 2018, Nature Communications.

[29]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[30]  A Keith Dunker,et al.  DescribePROT: database of amino acid-level protein structure and function predictions , 2020, Nucleic Acids Res..

[31]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[32]  Silvio C. E. Tosatto,et al.  The InterPro protein families and domains database: 20 years on , 2020, Nucleic Acids Res..

[33]  The UniProt Consortium,et al.  UniProt: a worldwide hub of protein knowledge , 2018, Nucleic Acids Res..

[34]  George M. Church,et al.  Unified rational protein engineering with sequence-based deep representation learning , 2019, Nature Methods.

[35]  D. Baker,et al.  Coupled prediction of protein secondary and tertiary structure , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Silvio C. E. Tosatto,et al.  The Pfam protein families database in 2019 , 2018, Nucleic Acids Res..

[37]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[38]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[39]  F. Rohwer,et al.  Metagenomics and future perspectives in virus discovery , 2012, Current Opinion in Virology.

[40]  Douglas L. Brutlag,et al.  Bayesian Segmentation of Protein Secondary Structure , 2000, J. Comput. Biol..

[41]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[42]  Maria Jesus Martin,et al.  SIFTS: updated Structure Integration with Function, Taxonomy and Sequences resource allows 40-fold increase in coverage of structure-based annotations for proteins , 2018, Nucleic Acids Res..

[43]  David T Jones,et al.  Prediction of interresidue contacts with DeepMetaPSICOV in CASP13 , 2019, Proteins.

[44]  Satoru Hayamizu,et al.  Prediction of protein secondary structure by the hidden Markov model , 1993, Comput. Appl. Biosci..

[45]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[46]  Georgios A. Pavlopoulos,et al.  Protein structure determination using metagenome sequence data , 2017, Science.

[47]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[48]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[49]  Yücel Altunbasak,et al.  Protein secondary structure prediction for a single-sequence using hidden semi-Markov models , 2006, BMC Bioinformatics.

[50]  Maria Jesus Martin,et al.  Uniclust databases of clustered and deeply annotated protein sequences and alignments , 2016, Nucleic Acids Res..

[51]  Johannes Söding,et al.  Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold , 2018, Nature Methods.

[52]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[53]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[54]  R. Edwards,et al.  Viral metagenomics , 2005, Nature Reviews Microbiology.

[55]  Daniel-Adriano Silva,et al.  Essentials of de novo protein design: Methods and applications , 2018, WIREs Computational Molecular Science.

[56]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[57]  P. Argos,et al.  Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence. , 1996, Protein engineering.

[58]  Kevin K. Yang,et al.  Machine-learning-guided directed evolution for protein engineering , 2018, Nature Methods.

[59]  D. Tegunov,et al.  Structure of SWI/SNF chromatin remodeler RSC bound to a nucleosome , 2019, Nature.

[60]  David Berthelot,et al.  FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence , 2020, NeurIPS.

[61]  Demis Hassabis,et al.  Improved protein structure prediction using potentials from deep learning , 2020, Nature.

[62]  C. Spahn,et al.  Insights into the assembly and activation of the microtubule nucleator γ-TuRC , 2019, Nature.

[63]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[64]  David S. Goodsell,et al.  RCSB Protein Data Bank: biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy , 2018, Nucleic Acids Res..

[65]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[66]  Johannes Söding,et al.  MMseqs2: sensitive protein sequence searching for the analysis of massive data sets , 2017, bioRxiv.

[67]  Joarder Kamruzzaman,et al.  Combining segmental semi-Markov models with neural networks for protein secondary structure prediction , 2009, Neurocomputing.

[68]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[69]  David T. Jones,et al.  Getting the most from PSI-BLAST. , 2002, Trends in biochemical sciences.

[70]  Gianluca Pollastri,et al.  Porter 5: state-of-the-art ab initio prediction of protein secondary structure in 3 and 8 classes , 2018, bioRxiv.

[71]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[72]  Jaswinder Singh,et al.  Single‐sequence‐based prediction of protein secondary structures and solvent accessibility by deep whole‐sequence learning , 2018, J. Comput. Chem..

[73]  Yaoqi Zhou,et al.  Improving prediction of protein secondary structure, backbone angles, solvent accessibility and contact numbers by using predicted contact maps and an ensemble of recurrent and residual convolutional neural networks , 2018, Bioinform..

[74]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[75]  Kuldip K. Paliwal,et al.  Sixty-five years of the long march in protein secondary structure prediction: the final stretch? , 2016, Briefings Bioinform..

[76]  J. Plojhar [20 YEARS]. , 1965, Casopis lekaru ceskych.

[77]  G J Barton,et al.  Evaluation and improvement of multiple sequence methods for protein secondary structure prediction , 1999, Proteins.

[78]  Burkhard Rost,et al.  Modeling aspects of the language of life through transfer-learning protein sequences , 2019, BMC Bioinformatics.