TooT-T: discrimination of transport proteins from non-transport proteins

Membrane transport proteins (transporters) play an essential role in every living cell by transporting hydrophilic molecules across the hydrophobic membranes. While the sequences of many membrane proteins are known, their structure and function is still not well characterized and understood, owing to the immense effort needed to characterize them. Therefore, there is a need for advanced computational techniques takes sequence information alone to distinguish membrane transporter proteins; this can then be used to direct new experiments and give a hint about the function of a protein. This work proposes an ensemble classifier TooT-T that is trained to optimally combine the predictions from homology annotation transfer and machine-learning methods to determine the final prediction. Experimental results obtained by cross-validation and independent testing show that combining the two approaches is more beneficial than employing only one. The proposed model outperforms all of the state-of-the-art methods that rely on the protein sequence alone, with respect to accuracy and MCC. TooT-T achieved an overall accuracy of 90.07% and 92.22% and an MCC 0.80 and 0.82 with the training and independent datasets, respectively.

[1]  J. Whisstock,et al.  Prediction of protein function from protein sequence and structure , 2003, Quarterly Reviews of Biophysics.

[2]  C. Tanford Contribution of Hydrophobic Interactions to the Stability of the Globular Conformation of Proteins , 1962 .

[3]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[4]  Gary D Stormo,et al.  An Introduction to Sequence Similarity (“Homology”) Searching , 2009, Current protocols in bioinformatics.

[5]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[6]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[7]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[8]  Ming Sun,et al.  Plasmids are vectors for redundant chromosomal genes in the Bacillus cereus group , 2014, BMC Genomics.

[9]  K. R. Woods,et al.  Prediction of protein antigenic determinants from amino acid sequences. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Ian T. Paulsen,et al.  Comparative Analyses of Fundamental Differences in Membrane Transport Capabilities in Prokaryotes and Eukaryotes , 2005, PLoS Comput. Biol..

[11]  A. Barabasi,et al.  Drug—target network , 2007, Nature Biotechnology.

[12]  Patrick X. Zhao,et al.  Prediction of Membrane Transport Proteins and Their Substrate Specificities Using Primary Sequence Information , 2014, PloS one.

[13]  Menglong Li,et al.  A consensus subunit-specific model for annotation of substrate specificity for ABC transporters , 2015 .

[14]  Ralf Klinkenberg,et al.  Data Classification: Algorithms and Applications , 2014 .

[15]  Volkhard Helms,et al.  Transferring functional annotations of membrane transporters on the basis of sequence similarity and sequence motifs , 2013, BMC Bioinformatics.

[16]  David W. Opitz,et al.  Generating Accurate and Diverse Members of a Neural-Network Ensemble , 1995, NIPS.

[17]  Milton H. Saier,et al.  The Transporter Classification Database (TCDB): recent advances , 2015, Nucleic Acids Res..

[18]  Yan-Qing Zhang,et al.  Diversified ensemble classifiers for highly imbalanced data learning and its application in bioinformatics , 2011 .

[19]  Shinn-Ying Ho,et al.  SCMMTP: identifying and characterizing membrane transport proteins using propensity scores of dipeptides , 2015, BMC Genomics.

[20]  Ron Kohavi,et al.  Wrappers for performance enhancement and oblivious decision graphs , 1995 .

[21]  Mohamed Bekkar,et al.  Evaluation Measures for Models Assessment over Imbalanced Data Sets , 2013 .

[22]  Weidong Xiao,et al.  Prediction the Substrate Specificities of Membrane Transport Proteins Based on Support Vector Machine and Hybrid Features , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[23]  Anders Krogh,et al.  Neural Network Ensembles, Cross Validation, and Active Learning , 1994, NIPS.

[24]  Yu-Yen Ou,et al.  Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters. , 2019, Analytical biochemistry.