ActTRANS: Functional classification in active transport proteins based on transfer learning and contextual representations

MOTIVATION Primary and secondary active transport are two types of active transport that involve using energy to move the substances. Active transport mechanisms do use proteins to assist in transport and play essential roles to regulate the traffic of ions or small molecules across a cell membrane against the concentration gradient. In this study, the two main types of proteins involved in such transport are classified from transmembrane transport proteins. We propose a Support Vector Machine (SVM) with contextualized word embeddings from Bidirectional Encoder Representations from Transformers (BERT) to represent protein sequences. BERT is a powerful model in transfer learning, a deep learning language representation model developed by Google and one of the highest performing pre-trained model for Natural Language Processing (NLP) tasks. The idea of transfer learning with pre-trained model from BERT is applied to extract fixed feature vectors from the hidden layers and learn contextual relations between amino acids in the protein sequence. Therefore, the contextualized word representations of proteins are introduced to effectively model complex structures of amino acids in the sequence and the variations of these amino acids in the context. By generating context information, we capture multiple meanings for the same amino acid to reveal the importance of specific residues in the protein sequence. RESULTS The performance of the proposed method is evaluated using five-fold cross-validation and independent test. The proposed method achieves an accuracy of 85.44 %, 88.74 % and 92.84 % for Class-1, Class-2, and Class-3, respectively. Experimental results show that this approach can outperform from other feature extraction methods using context information, effectively classify two types of active transport and improve the overall performance.

[1]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[2]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[3]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[4]  D. Nebert,et al.  Analysis and update of the human solute carrier (SLC) gene superfamily , 2009, Human Genomics.

[5]  Gui-Bin Bian,et al.  Performance Analysis of Google Colaboratory as a Tool for Accelerating Deep Learning Applications , 2018, IEEE Access.

[6]  Eunjeong Park,et al.  A context-aware citation recommendation model with BERT and graph convolutional networks , 2019, Scientometrics.

[7]  Gajendra P. S. Raghava,et al.  Prediction of Antitubercular Peptides From Sequence Information Using Ensemble Classifier and Hybrid Features , 2018, Front. Pharmacol..

[8]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[9]  Yu-Yen Ou,et al.  Prediction of transporter targets using efficient RBF networks with PSSM profiles and biochemical properties , 2011, Bioinform..

[10]  R. MacKinnon,et al.  Principles of Selective Ion Transport in Channels and Pumps , 2005, Science.

[11]  The UniProt Consortium,et al.  The Universal Protein Resource (UniProt) 2009 , 2008, Nucleic Acids Res..

[12]  András Kocsor,et al.  ROC analysis: applications to the classification of biological sequences and 3D structures , 2008, Briefings Bioinform..

[13]  Wael Hassan Gomaa,et al.  A Survey of Text Similarity Approaches , 2013 .

[14]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[15]  David Haussler,et al.  Classifying G-protein coupled receptors with support vector machines , 2002, Bioinform..

[16]  N. Le iN6-methylat (5-step): identifying DNA N6-methyladenine sites in rice genome using continuous bag of nucleobases via Chou’s 5-step rule , 2019, Molecular Genetics and Genomics.

[17]  Yu-Yen Ou,et al.  DeepIon: Deep learning approach for classifying ion transporters and ion channels from membrane proteins , 2019, J. Comput. Chem..

[18]  Yu-Yen Ou,et al.  Prediction of membrane spanning segments and topology in β‐barrel membrane proteins at better accuracy , 2010, J. Comput. Chem..

[19]  M. Michael Gromiha,et al.  Functional discrimination of membrane proteins using machine learning techniques , 2008, BMC Bioinformatics.

[20]  David A. Gough,et al.  Predicting protein-protein interactions from primary structure , 2001, Bioinform..

[21]  L. Forrest,et al.  The structural basis of secondary active transport mechanisms. , 2011, Biochimica et biophysica acta.

[22]  Yu-Yen Ou,et al.  DeepEfflux: a 2D convolutional neural network model for identifying families of efflux proteins in transporters , 2018, Bioinform..

[23]  Stan Szpakowicz,et al.  Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation , 2006, Australian Conference on Artificial Intelligence.

[24]  Yu-Yen Ou,et al.  TNFPred: identifying tumor necrosis factors using hybrid features based on word embeddings , 2019, BMC Medical Genomics.

[25]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Rachael P. Huntley,et al.  QuickGO: a web-based tool for Gene Ontology searching , 2009, Bioinform..

[27]  Li Liao,et al.  Combining pairwise sequence similarity and support vector machines for remote protein homology detection , 2002, RECOMB '02.

[28]  Y. Sugiyama,et al.  Primary active transport of organic anions on bile canalicular membrane in humans. , 1999, The American journal of physiology.

[29]  S. Hua,et al.  A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. , 2001, Journal of molecular biology.

[30]  Yu-Yen Ou,et al.  Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters. , 2019, Analytical biochemistry.

[31]  Fan Xiong,et al.  Auto-detection of epileptic seizure events using deep neural network with different feature scaling techniques , 2019, Pattern Recognit. Lett..

[32]  O. Boudker,et al.  Structural perspectives on secondary active transporters. , 2010, Trends in pharmacological sciences.

[33]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[34]  M. Gromiha,et al.  Classification of transporters using efficient radial basis function networks with position‐specific scoring matrices and biochemical properties , 2010, Proteins.

[35]  A. Akobeng,et al.  Understanding diagnostic tests 3: receiver operating characteristic curves , 2007, Acta paediatrica.

[36]  Iddo Friedberg,et al.  Identifying antimicrobial peptides using word embedding with deep recurrent neural networks , 2018, bioRxiv.

[37]  Milton H. Saier,et al.  TCDB: the Transporter Classification Database for membrane transport protein analyses and information , 2005, Nucleic Acids Res..

[38]  Patrick X. Zhao,et al.  Prediction of Membrane Transport Proteins and Their Substrate Specificities Using Primary Sequence Information , 2014, PloS one.

[40]  J. Lolkema,et al.  Structural and mechanistic diversity of secondary transporters. , 2005, Current opinion in microbiology.

[41]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.