Embeddings from deep learning transfer GO annotations beyond homology

Knowing protein function is crucial to advance molecular and medical biology, yet experimental function annotations through the Gene Ontology (GO) exist for fewer than 0.5% of all known proteins. Computational methods bridge this sequence-annotation gap typically through homology-based annotation transfer by identifying sequence-similar proteins with known function or through prediction methods using evolutionary information. Here, we propose predicting GO terms through annotation transfer based on proximity of proteins in the SeqVec embedding rather than in sequence space. These embeddings originate from deep learned language models (LMs) for protein sequences (SeqVec) transferring the knowledge gained from predicting the next amino acid in 33 million protein sequences. Replicating the conditions of CAFA3, our method reaches an Fmax of 37±2%, 50±3%, and 57±2% for BPO, MFO, and CCO, respectively. Numerically, this appears close to the top ten CAFA3 methods. When restricting the annotation transfer to proteins with <20% pairwise sequence identity to the query, performance drops (Fmax BPO 33±2%, MFO 43±3%, CCO 53±2%); this still outperforms naïve sequence-based transfer. Preliminary results from CAFA4 appear to confirm these findings. Overall, this new concept is likely to change the annotation of proteins, in particular for proteins from smaller families or proteins with intrinsically disordered regions.

[1]  Rachael P. Huntley,et al.  The Gene Ontology Annotation (GOA) Database , 2009 .

[2]  Michel Schneider,et al.  UniProtKB/Swiss-Prot. , 2007, Methods in molecular biology.

[3]  B. Rost,et al.  Protein structures sustain evolutionary drift. , 1997, Folding & design.

[4]  Johannes Söding,et al.  MMseqs2: sensitive protein sequence searching for the analysis of massive data sets , 2017, bioRxiv.

[5]  Burkhard Rost,et al.  ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing , 2020, bioRxiv.

[6]  Bosco K. Ho,et al.  Systematic modeling of SARS-CoV-2 protein structures , 2020 .

[7]  Tapio Salakoski,et al.  The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens , 2019, Genome Biology.

[8]  Johannes Söding,et al.  Clustering huge protein sequence sets in linear time , 2018 .

[9]  Burkhard Rost,et al.  LocTree3 prediction of localization , 2014, Nucleic Acids Res..

[10]  M. Kanehisa,et al.  Prediction of protein function from sequence properties. Discriminant analysis of a data base. , 1984, Biochimica et biophysica acta.

[11]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[12]  The Gene Ontology Consortium,et al.  The Gene Ontology Resource: 20 years and still GOing strong , 2018, Nucleic Acids Res..

[13]  Yana Bromberg,et al.  Computational prediction shines light on type III secretion origins , 2016, Scientific Reports.

[14]  Björn W. Schuller,et al.  Contextual Bidirectional Long Short-Term Memory Recurrent Neural Network Language Models: A Generative Approach to Sentiment Analysis , 2017, EACL.

[15]  K. Nakai,et al.  PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. , 1999, Trends in biochemical sciences.

[16]  Marco Punta,et al.  Beyond annotation transfer by homology: novel protein-function prediction methods to assist drug discovery. , 2005, Drug discovery today.

[17]  C. Sander,et al.  Yeast chromosome III: new gene functions. , 1994, The EMBO journal.

[18]  Nadia El-Mabrouk,et al.  ISMB 2020 proceedings , 2020, Bioinform..

[19]  Chandra Bhagavatula,et al.  Semi-supervised sequence tagging with bidirectional language models , 2017, ACL.

[20]  Robert E. Schapire,et al.  Hierarchical multi-label prediction of gene function , 2006, Bioinform..

[21]  James B. Anderson,et al.  Clonal evolution and genome stability in a 2,500-year-old fungal individual , 2018, bioRxiv.

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[24]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[25]  James C. Hu,et al.  The Gene Ontology Resource: 20 years and still GOing strong , 2019 .

[26]  Burkhard Rost,et al.  Modeling aspects of the language of life through transfer-learning protein sequences , 2019, BMC Bioinformatics.

[27]  Burkhard Rost,et al.  Protein–Protein Interactions More Conserved within Species than across Species , 2006, PLoS Comput. Biol..

[28]  Maxat Kulmanov,et al.  DeepGOPlus: Improved protein function prediction from sequence , 2019 .

[29]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[30]  B. Rost,et al.  Automatic prediction of protein function , 2003, Cellular and Molecular Life Sciences CMLS.

[31]  Johannes Söding,et al.  Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold , 2018, Nature Methods.

[32]  Christian Schaefer,et al.  Homology-based inference sets the bar high for protein function prediction , 2013, BMC Bioinformatics.

[33]  Jari Björne,et al.  The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens , 2019, Genome Biology.

[34]  K Nishikawa,et al.  Correlation of the amino acid composition of a protein to its structural and biological characters. , 1982, Journal of biochemistry.

[35]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[36]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[37]  E. Zuckerkandl Evolutionary processes and evolutionary noise at the molecular level , 1976, Journal of Molecular Evolution.

[38]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[39]  B. Rost,et al.  ProNA2020 predicts protein-DNA, protein-RNA and protein-protein binding proteins and residues from sequence. , 2020, Journal of molecular biology.

[40]  Hannah Currant,et al.  FFPred 3: feature-based function prediction for all Gene Ontology domains , 2016, Scientific Reports.

[41]  Burkhard Rost,et al.  Inferring sub-cellular localization through automated lexical analysis , 2002, ISMB.

[42]  H. Krebs,et al.  Metabolism of ketonic acids in animal tissues. , 1937, The Biochemical journal.

[43]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[44]  Torsten Schwede,et al.  SWISS-MODEL: homology modelling of protein structures and complexes , 2018, Nucleic Acids Res..

[45]  Lav R. Varshney,et al.  BERTology Meets Biology: Interpreting Attention in Protein Language Models , 2020, bioRxiv.

[46]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[47]  Prudence Mutowo-Meullenet,et al.  The GOA database: Gene Ontology annotation updates for 2015 , 2014, Nucleic Acids Res..

[48]  B. Rost,et al.  Adaptation of protein surfaces to subcellular location. , 1998, Journal of molecular biology.

[49]  P. Radivojac,et al.  Analysis of protein function and its prediction from amino acid sequence , 2011, Proteins.

[50]  Predrag Radivojac,et al.  Community-Wide Evaluation of Computational Function Prediction. , 2016, Methods in molecular biology.

[51]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[52]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[53]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[54]  Daisuke Kihara,et al.  NaviGO: interactive tool for visualization and functional similarity and coherence analysis with gene ontology , 2017, BMC Bioinformatics.

[55]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[56]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[57]  H. Margalit,et al.  Quantitative parameters for amino acid-base interaction: implications for prediction of protein-DNA binding sites. , 1998, Nucleic acids research.

[58]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[59]  Tapio Salakoski,et al.  An expanded evaluation of protein function prediction methods shows an improvement in accuracy , 2016, Genome Biology.

[60]  M J Sternberg,et al.  Prediction of structural and functional features of protein and nucleic acid sequences by artificial neural networks. , 1992, Biochemistry.

[61]  David Warde-Farley,et al.  GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function , 2008, Genome Biology.

[62]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[63]  Emily Dimmer,et al.  The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology , 2004, Nucleic Acids Res..

[64]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[65]  Burkhard Rost,et al.  Sequence conserved for subcellular localization , 2002, Protein science : a publication of the Protein Society.

[66]  Timothy M. Hospedales,et al.  Analogies Explained: Towards Understanding Word Embeddings , 2019, ICML.

[67]  Guoyin Wang,et al.  Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms , 2018, ACL.

[68]  The UniProt Consortium,et al.  UniProt: a worldwide hub of protein knowledge , 2018, Nucleic Acids Res..

[69]  Chee Keong Kwoh,et al.  Structural analysis of the novel influenza A (H7N9) viral Neuraminidase interactions with current approved neuraminidase inhibitors Oseltamivir, Zanamivir, and Peramivir in the presence of mutation R289K , 2013, BMC Bioinformatics.

[70]  Maxat Kulmanov,et al.  DeepGOPlus: improved protein function prediction from sequence , 2019, bioRxiv.

[71]  Emile Zuckerkandl,et al.  Evolutionary processes and evolutionary noise at the molecular level , 1976, Journal of Molecular Evolution.

[72]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[73]  B. Rost Enzyme function less conserved than anticipated. , 2002, Journal of molecular biology.

[74]  J. Söding,et al.  Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold , 2018, bioRxiv.

[75]  T. Gaasterland,et al.  Microbial genescapes: phyletic and functional patterns of ORF distribution among prokaryotes. , 1998, Microbial & comparative genomics.

[76]  Burkhard Rost,et al.  SARS-CoV-2 structural coverage map reveals state changes that disrupt host immunity , 2020 .

[77]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.