Evaluation of Methods for Protein Representation Learning: A Quantitative Analysis

Data-centric approaches have been utilized to develop predictive methods for elucidating uncharacterized aspects of proteins such as their functions, biophysical properties, subcellular locations and interactions. However, studies indicate that the performance of these methods should be further improved to effectively solve complex problems in biomedicine and biotechnology. A data representation method can be defined as an algorithm that calculates numerical feature vectors for samples in a dataset, to be later used in quantitative modelling tasks. Data representation learning methods do this by training and using a model that employs statistical and machine/deep learning algorithms. These novel methods mostly take inspiration from the data-driven language models that have yielded ground-breaking improvements in the field of natural language processing. Lately, these learned data representations have been applied to the field of protein informatics and have displayed highly promising results in terms of extracting complex traits of proteins regarding sequence-structure-function relations. In this study, we conducted a detailed investigation over protein representation learning methods, by first categorizing and explaining each approach, and then conducting benchmark analyses on; (i) inferring semantic similarities between proteins, (ii) predicting ontology-based protein functions, and (iii) classifying drug target protein families. We examine the advantages and disadvantages of each representation approach over the benchmark results. Finally, we discuss current challenges and suggest future directions. We believe the conclusions of this study will help researchers in applying machine/deep learning-based representation techniques on protein data for various types of predictive tasks. Furthermore, we hope it will demonstrate the potential of machine learning-based data representations for protein science and inspire the development of novel methods/tools to be utilized in the fields of biomedicine and biotechnology.

[1]  Ruslan Salakhutdinov,et al.  Learning Deep Generative Models , 2009 .

[2]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[3]  Burkhard Rost,et al.  ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing , 2020, bioRxiv.

[4]  Jürgen Schmidhuber,et al.  Unsupervised Learning in LSTM Recurrent Neural Networks , 2001, ICANN.

[5]  Peter B. McGarvey,et al.  UniRef: comprehensive and non-redundant UniProt reference clusters , 2007, Bioinform..

[6]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[7]  Prajjwal Bhargava Adaptive Transformers for Learning Multimodal Representations , 2020, ACL.

[8]  Matti Pietikäinen,et al.  Deep Learning for Generic Object Detection: A Survey , 2018, International Journal of Computer Vision.

[9]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[10]  S. Van Dien,et al.  Biotechnology for Chemical Production: Challenges and Opportunities. , 2016, Trends in biotechnology.

[11]  Ethan C. Alley,et al.  Low-N protein engineering with data-efficient deep learning , 2020, Nature Methods.

[12]  Manja Marz,et al.  An encoding of genome content for machine learning , 2019 .

[13]  James Zou,et al.  Feedback GAN for DNA optimizes protein functions , 2019, Nature Machine Intelligence.

[14]  Manja Marz,et al.  Distributed representations of protein domains and genomes and their compositionality , 2019, bioRxiv.

[15]  Arjun K. Bansal,et al.  Deep Semantic Protein Representation for Annotation, Discovery, and Engineering , 2018, bioRxiv.

[16]  Guillaume Lample,et al.  XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.

[17]  Zachary Wu,et al.  Learned protein embeddings for machine learning , 2018, Bioinformatics.

[18]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[19]  Gary Geunbae Lee,et al.  Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2012, ACL 2012.

[20]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[21]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[22]  Andrew R. Leach,et al.  ChEMBL: towards direct deposition of bioassay data , 2018, Nucleic Acids Res..

[23]  Bonnie Berger,et al.  Learning protein sequence embeddings using information from structure , 2019, ICLR.

[24]  Lorenzo Rosasco,et al.  Unsupervised learning of invariant representations , 2016, Theor. Comput. Sci..

[25]  Daisuke Kihara,et al.  Phylo‐PFP: improved automated protein function prediction using phylogenetic distance of distantly related sequences , 2018, Bioinform..

[26]  Burkhard Rost,et al.  Modeling the language of life – Deep Learning Protein Sequences , 2019, bioRxiv.

[27]  Elif Ozkirimli,et al.  WideDTA: prediction of drug-target binding affinity , 2019, ArXiv.

[28]  Vladimir A. Kulyukin,et al.  Generalized Hamming Distance , 2002, Information Retrieval.

[29]  Trevor Cohen,et al.  Graded Vector Representations of Immunoglobulins Produced in Response to West Nile Virus , 2016, QI.

[30]  Sabrina Jaeger,et al.  Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition , 2018, J. Chem. Inf. Model..

[31]  Jianyang Zeng,et al.  Deep learning with feature embedding for compound-protein interaction prediction , 2016, bioRxiv.

[32]  Ashish Anand,et al.  SpliceVec: distributed feature representations for splice junction prediction , 2017, bioRxiv.

[33]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[34]  Jingcheng Du,et al.  Gene2vec: distributed representation of genes based on co-expression , 2018, BMC Genomics.

[35]  Yibo Wu,et al.  GOSemSim: an R package for measuring semantic similarity among GO terms and gene products , 2010, Bioinform..

[36]  P. Dobson,et al.  Distinguishing enzyme structures from non-enzymes without alignments. , 2003, Journal of molecular biology.

[37]  A. Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP)—Round XII , 2018, Proteins.

[38]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[39]  Helga Thorvaldsdóttir,et al.  Molecular signatures database (MSigDB) 3.0 , 2011, Bioinform..

[40]  Moshe Wasserblat,et al.  Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[41]  Arzucan Özgür,et al.  DeepDTA: deep drug–target binding affinity prediction , 2018, Bioinform..

[42]  Ted Pedersen,et al.  Evaluating measures of semantic similarity and relatedness to disambiguate terms in biomedical text , 2013, J. Biomed. Informatics.

[43]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[44]  Tapio Salakoski,et al.  The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens , 2019, Genome Biology.

[45]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[46]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[47]  Alán Aspuru-Guzik,et al.  Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules , 2016, ACS central science.

[48]  Alice C McHardy,et al.  Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX) , 2018, Scientific Reports.

[49]  James M. Hogan,et al.  Distributed Representations for Biological Sequence Analysis , 2016, ArXiv.

[50]  Bruce Tidor,et al.  Computational design of antibody-affinity improvement beyond in vivo maturation , 2007, Nature Biotechnology.

[51]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[52]  Volkan Atalay,et al.  DEEPred: Automated Protein Function Prediction with Multi-task Feed-forward Deep Neural Networks , 2019, Scientific Reports.

[53]  V. Uversky,et al.  Why are “natively unfolded” proteins unstructured under physiologic conditions? , 2000, Proteins.

[54]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[55]  Patrick Ng,et al.  dna2vec: Consistent vector representations of variable-length k-mers , 2017, ArXiv.

[56]  Hilal Tayara,et al.  Deep Learning Models Based on Distributed Feature Representations for Alternative Splicing Prediction , 2018, IEEE Access.

[57]  Demis Hassabis,et al.  Improved protein structure prediction using potentials from deep learning , 2020, Nature.

[58]  Roberto A. Chica,et al.  Iterative approach to computational enzyme design , 2012, Proceedings of the National Academy of Sciences.

[59]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[60]  Julien Mairal,et al.  Invariance and Stability of Deep Convolutional Representations , 2017, NIPS.

[61]  Maria Jesus Martin,et al.  ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature , 2018, BMC Bioinformatics.

[62]  Zhiyong Lu,et al.  Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets , 2019, BioNLP@ACL.

[63]  U. Rothlisberger,et al.  Mixed Quantum Mechanical/Molecular Mechanical Molecular Dynamics Simulations of Biological Systems in Ground and Electronically Excited States. , 2015, Chemical reviews.

[64]  Geoffrey I. Webb,et al.  POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles , 2017, Bioinform..

[65]  Jaegyoon Ahn,et al.  G2Vec: Distributed gene representations for identification of cancer prognostic genes , 2018, Scientific Reports.

[66]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[67]  Gisbert Schneider,et al.  Designing Anticancer Peptides by Constructive Machine Learning , 2018, ChemMedChem.

[68]  Tanya Barrett,et al.  The Gene Expression Omnibus Database , 2016, Statistical Genomics.

[69]  Diogo A. R. S. Latino,et al.  Assignment of EC Numbers to Enzymatic Reactions with MOLMAP Reaction Descriptors and Random Forests , 2009, J. Chem. Inf. Model..

[70]  Björn Wallner,et al.  rawMSA: End-to-end Deep Learning using raw Multiple Sequence Alignments , 2019, PloS one.

[71]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[72]  Ron O. Dror,et al.  Molecular Dynamics Simulation for All , 2018, Neuron.

[73]  Wei Li,et al.  RaptorX-Property: a web server for protein structure property prediction , 2016, Nucleic Acids Res..

[74]  O. Keskin,et al.  Predicting Protein-Protein Interactions from the Molecular to the Proteome Level. , 2016, Chemical reviews.

[75]  Kenji Satou,et al.  Improving Protein Sequence Classification Performance Using Adjacent and Overlapped Segments on Existing Protein Descriptors , 2018 .

[76]  Yanjun Qi,et al.  A Unified Multitask Architecture for Predicting Local Protein Properties , 2012, PloS one.

[77]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[78]  Pelkins Ajanoh,et al.  Augmenting protein network embeddings with sequence information , 2019, bioRxiv.

[79]  Steve Renals,et al.  Multiplicative LSTM for sequence modelling , 2016, ICLR.

[80]  M. Vendruscolo,et al.  Statistical mechanics of the denatured state of a protein using replica-averaged metadynamics. , 2014, Journal of the American Chemical Society.

[81]  Jaewoo Kang,et al.  Mut2Vec: distributed representation of cancerous mutations , 2018, BMC Medical Genomics.

[82]  Yi Xiong,et al.  GOLabeler: Improving Sequence-based Large-scale Protein Function Prediction by Learning to Rank , 2017, bioRxiv.

[83]  Xiaoqin Zou,et al.  Statistical mechanics‐based method to extract atomic distance‐dependent potentials from protein structures , 2011, Proteins.

[84]  John P. Overington,et al.  HOMSTRAD: A database of protein structure alignments for homologous families , 1998, Protein science : a publication of the Protein Society.

[85]  Ehsaneddin Asgari,et al.  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics , 2015, PloS one.

[86]  SchmidhuberJürgen Deep learning in neural networks , 2015 .

[87]  Seishi Shimizu,et al.  Cooperativity principles in protein folding. , 2004, Methods in enzymology.

[88]  John Canny,et al.  Evaluating Protein Transfer Learning with TAPE , 2019, bioRxiv.

[89]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[90]  George M. Church,et al.  Unified rational protein engineering with sequence-only deep representation learning , 2019, bioRxiv.

[91]  James C. Hu,et al.  The Gene Ontology Resource: 20 years and still GOing strong , 2019 .

[92]  Namrata Anand,et al.  Ig-VAE: Generative modeling of protein structure by direct 3D coordinate generation , 2020, bioRxiv.

[93]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..

[94]  M. Huss,et al.  A primer on deep learning in genomics , 2018, Nature Genetics.

[95]  L. Looger,et al.  Computational design of receptor and sensor proteins with novel functions , 2003, Nature.

[96]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[97]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[98]  Hamed Haddadi,et al.  Deep Learning in Mobile and Wireless Networking: A Survey , 2018, IEEE Communications Surveys & Tutorials.

[99]  A. Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP)—round IX , 2011, Proteins.

[100]  Chi Hang Wong,et al.  Infer related genes from large scale gene expression dataset with embedding , 2018, bioRxiv.

[101]  David Baker,et al.  An exciting but challenging road ahead for computational enzyme design , 2010, Protein science : a publication of the Protein Society.

[102]  Yang Liu,et al.  On Identifiability in Transformers , 2020, ICLR.

[103]  Eric A. Althoff,et al.  Kemp elimination catalysts by computational enzyme design , 2008, Nature.

[104]  Zhaoyu Li,et al.  Deep Networks and Continuous Distributed Representation of Protein Sequences for Protein Quality Assessment , 2017, 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI).

[105]  Wei Zhang,et al.  A point‐charge force field for molecular mechanics simulations of proteins based on condensed‐phase quantum mechanical calculations , 2003, J. Comput. Chem..

[106]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[107]  Shanfeng Zhu,et al.  DeepText2Go: Improving large-scale protein function prediction with deep semantic text representation , 2017, 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[108]  David T. Jones,et al.  Design of metalloproteins and novel protein folds using variational autoencoders , 2018, Scientific Reports.

[109]  Samuel Karlin,et al.  Protein length in eukaryotic and prokaryotic proteomes , 2005, Nucleic acids research.

[110]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[111]  Frank DiMaio,et al.  Protein structure prediction using Rosetta in CASP12 , 2018, Proteins.

[112]  Ashish Anand,et al.  SpliceVec: distributed feature representations for splice junction prediction , 2017, bioRxiv.

[113]  Jason Weston,et al.  Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding , 2011, PLoS Comput. Biol..

[114]  J. Selbig,et al.  SLocX: Predicting Subcellular Localization of Arabidopsis Proteins Leveraging Gene Expression Data , 2011, Front. Plant Sci..

[115]  Jiangning Song,et al.  PhosContext2vec: a distributed representation of residue-level sequence contexts and its application to general and kinase-specific phosphorylation site prediction , 2018, Scientific Reports.

[116]  David Pfau,et al.  Towards a Definition of Disentangled Representations , 2018, ArXiv.

[117]  Marcello Farina,et al.  LSTM Neural Networks: Input to State Stability and Probabilistic Safety Verification , 2019, L4DC.

[118]  Regina Barzilay,et al.  Convolutional Embedding of Attributed Molecular Graphs for Physical Property Prediction , 2017, J. Chem. Inf. Model..

[119]  Alexey G. Murzin,et al.  SCOP2 prototype: a new approach to protein structure mining , 2014, Nucleic Acids Res..

[120]  D. M. Titterington,et al.  Comment on “On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes” , 2008, Neural Processing Letters.

[121]  Wojciech Samek,et al.  UDSMProt: universal deep sequence models for protein classification , 2019, bioRxiv.

[122]  Andrew Y. Ng,et al.  Improving Word Representations via Global Context and Multiple Word Prototypes , 2012, ACL.

[123]  Pablo Gainza,et al.  Algorithms for protein design. , 2016, Current opinion in structural biology.

[124]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[125]  Andre Esteva,et al.  A guide to deep learning in healthcare , 2019, Nature Medicine.

[126]  Kui Zhang,et al.  Prediction of protein function using protein-protein interaction data , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[127]  Simona Cocco,et al.  Learning protein constitutive motifs from sequence data , 2018, eLife.

[128]  Gajendra P. S. Raghava,et al.  OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy , 2003, BMC Bioinformatics.

[129]  Alice C. McHardy,et al.  DeepPrime2Sec: Deep Learning for Protein Secondary Structure Prediction from the Primary Sequences , 2019, bioRxiv.

[130]  Stephen Merity,et al.  Single Headed Attention RNN: Stop Thinking With Your Head , 2019, ArXiv.

[131]  Martin Wattenberg,et al.  Visualizing and Measuring the Geometry of BERT , 2019, NeurIPS.

[132]  Valerie Daggett,et al.  Insights from molecular dynamics simulations for computational protein design. , 2017, Molecular systems design & engineering.

[133]  Byron C. Wallace,et al.  Attention is not Explanation , 2019, NAACL.

[134]  M. K. Mejía-Guerra,et al.  A k-mer grammar analysis to uncover maize regulatory architecture , 2019, BMC Plant Biology.

[135]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[136]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[137]  Niles A Pierce,et al.  Protein design is NP-hard. , 2002, Protein engineering.