Capturing Protein Domain Structure and Function Using Self-Supervision on Domain Architectures

Predicting biological properties of unseen proteins is shown to be improved by the use of protein sequence embeddings. However, these sequence embeddings have the caveat that biological metadata do not exist for each amino acid, in order to measure the quality of each unique learned embedding vector separately. Therefore, current sequence embedding cannot be intrinsically evaluated on the degree of their captured biological information in a quantitative manner. We address this drawback by our approach, dom2vec, by learning vector representation for protein domains and not for each amino acid base, as biological metadata do exist for each domain separately. To perform a reliable quantitative intrinsic evaluation in terms of biology knowledge, we selected the metadata related to the most distinctive biological characteristics of a domain, which are its structure, enzymatic, and molecular function. Notably, dom2vec obtains an adequate level of performance in the intrinsic assessment—therefore, we can draw an analogy between the local linguistic features in natural languages and the domain structure and function information in domain architectures. Moreover, we demonstrate the dom2vec applicability on protein prediction tasks, by comparing it with state-of-the-art sequence embeddings in three downstream tasks. We show that dom2vec outperforms sequence embeddings for toxin and enzymatic function prediction and is comparable with sequence embeddings in cellular location prediction.

[1]  Rolf Apweiler,et al.  IntEnz, the integrated relational enzyme database , 2004, Nucleic Acids Res..

[2]  Eneko Agirre,et al.  A reproducible survey on word embeddings and ontology-based methods for word similarity: Linear combinations outperform the state of the art , 2019, Eng. Appl. Artif. Intell..

[3]  Michael Levitt,et al.  The language of the protein universe. , 2015, Current opinion in genetics & development.

[4]  Steven E. Brenner,et al.  SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures , 2013, Nucleic Acids Res..

[5]  David J. Barlow,et al.  Machine learning can differentiate venom toxins from other proteins having non-toxic physiological functions , 2016, PeerJ Comput. Sci..

[6]  Satoshi Matsuoka,et al.  Word Embeddings, Analogies, and Machine Learning: Beyond king - man + woman = queen , 2016, COLING.

[7]  Zachary Wu,et al.  Learned protein embeddings for machine learning , 2018, Bioinformatics.

[8]  S. Brunak,et al.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. , 2000, Journal of molecular biology.

[9]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[10]  K. Chou,et al.  Using Functional Domain Composition and Support Vector Machines for Prediction of Protein Subcellular Location* , 2002, The Journal of Biological Chemistry.

[11]  Malay Kumar Basu,et al.  Grammar of protein domain architectures , 2019, Proceedings of the National Academy of Sciences.

[12]  E. Sonnhammer,et al.  Evolution of protein domain architectures. , 2012, Methods in molecular biology.

[13]  Andrew D. Moore,et al.  Arrangements in the modular evolution of proteins. , 2008, Trends in biochemical sciences.

[14]  Erik L. L. Sonnhammer,et al.  Predicting protein function from domain content , 2008, Bioinform..

[15]  Giuseppe Attardi,et al.  Detecting the scope of negations in clinical notes , 2015 .

[16]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[17]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[18]  George M. Church,et al.  Unified rational protein engineering with sequence-based deep representation learning , 2019, Nature Methods.

[19]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[20]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[21]  Quoc V. Le,et al.  Addressing the Rare Word Problem in Neural Machine Translation , 2014, ACL.

[22]  Lihua Li,et al.  DEEPre: sequence-based enzyme EC number prediction by deep learning , 2017, Bioinform..

[23]  Sean R. Eddy,et al.  Pfam: multiple sequence alignments and HMM-profiles of protein domains , 1998, Nucleic Acids Res..

[24]  C. Orengo,et al.  Protein function annotation by homology-based inference , 2009, Genome Biology.

[25]  Erich Bornberg-Bauer,et al.  Rapid similarity search of proteins using alignments of domain arrangements , 2014, Bioinform..

[26]  Matthew Fraser,et al.  InterProScan 5: genome-scale protein function classification , 2014, Bioinform..

[27]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[28]  Alice C McHardy,et al.  Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX) , 2018, Scientific Reports.

[29]  Burkhard Rost,et al.  Modeling aspects of the language of life through transfer-learning protein sequences , 2019, BMC Bioinformatics.

[30]  Thorsten Joachims,et al.  Evaluation methods for unsupervised word embeddings , 2015, EMNLP.

[31]  Silvio C. E. Tosatto,et al.  InterPro in 2019: improving coverage, classification and access to protein sequence annotations , 2018, Nucleic Acids Res..

[32]  Daniel W. A. Buchan,et al.  Learning a functional grammar of protein domains using natural language word embedding techniques , 2019, Proteins.

[33]  Bo Yu,et al.  CDD/SPARCLE: functional classification of proteins via subfamily domain architectures , 2016, Nucleic Acids Res..

[34]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[35]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[36]  Maria Jesus Martin,et al.  UniProt-DAAC: domain architecture alignment and classification, a new method for automatic functional annotation in UniProtKB , 2016, Bioinform..

[37]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.