STS-NLSP: A Network-Based Label Space Partition Method for Predicting the Specificity of Membrane Transporter Substrates Using a Hybrid Feature of Structural and Semantic Similarity

Membrane transport proteins play crucial roles in the pharmacokinetics of substrate drugs, the drug resistance in cancer and are vital to the process of drug discovery, development and anti-cancer therapeutics. However, experimental methods to profile a substrate drug against a panel of transporters to determine its specificity are labor intensive and time consuming. In this article, we aim to develop an in silico multi-label classification approach to predict whether a substrate can specifically recognize one of the 13 categories of drug transporters ranging from ATP-binding cassette to solute carrier families using both structural fingerprints and chemical ontologies information of substrates. The data-driven network-based label space partition (NLSP) method was utilized to construct the model based on a hybrid of similarity-based feature by the integration of 2D fingerprint and semantic similarity. This method builds predictors for each label cluster (possibly intersecting) detected by community detection algorithms and takes union of label sets for a compound as final prediction. NLSP lies into the ensembles of multi-label classifier category in multi-label learning field. We utilized Cramér's V statistics to quantify the label correlations and depicted them via a heatmap. The jackknife tests and iterative stratification based cross-validation method were adopted on a benchmark dataset to evaluate the prediction performance of the proposed models both in multi-label and label-wise manner. Compared with other powerful multi-label methods, ML-kNN, MTSVM, and RAkELd, our multi-label classification model of NLPS-RF (random forest-based NLSP) has proven to be a feasible and effective model, and performed satisfactorily in the predictive task of transporter-substrate specificity. The idea behind NLSP method is intriguing and the power of NLSP remains to be explored for the multi-label learning problems in bioinformatics. The benchmark dataset, intermediate results and python code which can fully reproduce our experiments and results are available at https://github.com/dqwei-lab/STS.

[1]  Mário J. Silva,et al.  Disjunctive shared information between ontology concepts: application to Gene Ontology , 2011, J. Biomed. Semant..

[2]  Yi Xiong,et al.  PseUI: Pseudouridine sites identification based on RNA sequence information , 2018, BMC Bioinformatics.

[3]  A. Gemma,et al.  F1000 highlights , 2010 .

[4]  P. Garg,et al.  The role of multidrug resistance protein (MRP-1) as an active efflux transporter on blood–brain barrier (BBB) permeability , 2017, Molecular Diversity.

[5]  T. Chou,et al.  Elevated BCRP/ABCG2 Expression Confers Acquired Resistance to Gefitinib in Wild-Type EGFR-Expressing Cells , 2011, PloS one.

[6]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[7]  Xiaohong Li,et al.  Feature-derived graph regularized matrix factorization for predicting drug side effects , 2018, Neurocomputing.

[8]  Bergsma,et al.  A bias-correction for Cramér ’ s V and Tschuprow ’ s T Wicher , 2012 .

[9]  Xinhao Lin,et al.  Discovery of CDK4 inhibitors by convolutional neural networks. , 2019, Future medicinal chemistry.

[10]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[11]  J. Fletcher,et al.  ABC transporters in cancer: more than just drug efflux pumps , 2010, Nature Reviews Cancer.

[12]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[13]  Zhi-Hua Zhou,et al.  A k-nearest neighbor based algorithm for multi-label classification , 2005, 2005 IEEE International Conference on Granular Computing.

[14]  Daisuke Kihara,et al.  Survey of Machine Learning Techniques for Prediction of the Isoform Specificity of Cytochrome P450 Substrates. , 2019, Current drug metabolism.

[15]  Reshma Khemchandani,et al.  Twin Support Vector Machines for Pattern Classification , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Tingjun Hou,et al.  ADME evaluation in drug discovery , 2002, Journal of molecular modeling.

[17]  Myeong Ok Kim,et al.  PIP-EL: A New Ensemble Learning Method for Improved Proinflammatory Peptide Predictions , 2018, Front. Immunol..

[18]  Woody Sherman,et al.  Analysis and comparison of 2D fingerprints: insights into database screening performance using eight fingerprint methods , 2010, J. Cheminformatics.

[19]  Yan He,et al.  Classification of Small GTPases with Hybrid Protein Features and Advanced Machine Learning Techniques , 2017, Current Bioinformatics.

[20]  Hanspeter Pfister,et al.  UpSet: Visualization of Intersecting Sets , 2014, IEEE Transactions on Visualization and Computer Graphics.

[21]  Christoph Steinbeck,et al.  The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013 , 2012, Nucleic Acids Res..

[22]  Hiroshi Mamitsuka,et al.  NetGO: improving large-scale protein function prediction with massive network information , 2018, bioRxiv.

[23]  Xiaochan Wang,et al.  Predicting CircRNA-Disease Associations Through Linear Neighborhood Label Propagation Method , 2019, IEEE Access.

[24]  Kristian Kersting,et al.  How is a data-driven approach better than random choice in label space division for multi-label classification? , 2016, Entropy.

[25]  Yi Xiong,et al.  Protein-protein interface hot spots prediction based on a hybrid feature selection strategy , 2018, BMC Bioinformatics.

[26]  Lihong Hu,et al.  A 20(S)-protopanoxadiol derivative overcomes multi-drug resistance by antagonizing ATP-binding cassette subfamily B member 1 transporter function , 2016, Oncotarget.

[27]  Jooyoung Lee,et al.  SVMQA: support‐vector‐machine‐based protein single‐model quality assessment , 2017, Bioinform..

[28]  Yiyu Cheng,et al.  Identifying P-Glycoprotein Substrates Using a Support Vector Machine Optimized by a Particle Swarm , 2007, J. Chem. Inf. Model..

[29]  Phillip W. Lord,et al.  Semantic Similarity in Biomedical Ontologies , 2009, PLoS Comput. Biol..

[30]  Kuo-Chen Chou,et al.  iATC-mISF: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals. , 2017, Bioinformatics.

[31]  Zsolt Bikádi,et al.  Predicting substrates of the human breast cancer resistance protein using a support vector machine method , 2013, BMC Bioinformatics.

[32]  Helena Sofia Pinto,et al.  The Next Generation of Similarity Measures that Fully Explore the Semantics in Biomedical Ontologies , 2013, J. Bioinform. Comput. Biol..

[33]  Shi-Hua Zhang,et al.  DrugE-Rank: improving drug–target interaction prediction of new candidate drugs or targets by ensemble learning to rank , 2016, Bioinform..

[34]  Xiangxiang Zeng,et al.  Predicting disease-associated circular RNAs using deep forests combined with positive-unlabeled learning methods , 2020, Briefings Bioinform..

[35]  Andreas Bender,et al.  P-glycoprotein Substrate Models Using Support Vector Machines Based on a Comprehensive Data set , 2011, J. Chem. Inf. Model..

[36]  Eyke Hüllermeier,et al.  Multilabel classification for exploiting cross-resistance information in HIV-1 drug resistance prediction , 2013, Bioinform..

[37]  Wen Zhang,et al.  The linear neighborhood propagation method for predicting long non-coding RNA-protein interactions , 2018, Neurocomputing.

[38]  Shifan Ma,et al.  Prediction of Orthosteric and Allosteric Regulations on Cannabinoid Receptors Using Supervised Machine Learning Classifiers. , 2019, Molecular pharmaceutics.

[39]  Kuo-Chen Chou,et al.  iATC‐mISF: a multi‐label classifier for predicting the classes of anatomical therapeutic chemicals , 2016, Bioinform..

[40]  Jing Lu,et al.  A hybrid method for prediction and repositioning of drug Anatomical Therapeutic Chemical classes. , 2014, Molecular bioSystems.

[41]  Michael Darsow,et al.  ChEBI: a database and ontology for chemical entities of biological interest , 2007, Nucleic Acids Res..

[42]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[43]  Balachandran Manavalan,et al.  Random Forest-Based Protein Model Quality Assessment (RFMQA) Using Structural Features and Potential Energy Terms , 2014, PloS one.

[44]  Michael D. Nyquist,et al.  Harnessing Solute Carrier Transporters for Precision Oncology , 2017, Molecules.

[45]  P. Johnston,et al.  Cancer drug resistance: an evolving paradigm , 2013, Nature Reviews Cancer.

[46]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[47]  Dong-Qing Wei,et al.  PredT4SE-Stack: Prediction of Bacterial Type IV Secreted Effectors From Protein Sequences Using a Stacked Ensemble Method , 2018, Front. Microbiol..

[48]  Feng Huang,et al.  A Fast Linear Neighborhood Similarity-Based Network Link Inference Method to Predict MicroRNA-Disease Associations , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[49]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[50]  Dennis Shasha,et al.  Fast Subgraph Matching Strategies Based on Pattern-Only Heuristics , 2019, Interdisciplinary Sciences: Computational Life Sciences.

[51]  Junfeng Xia,et al.  Exploiting a Reduced Set of Weighted Average Features to Improve Prediction of DNA-Binding Residues from 3D Structures , 2011, PloS one.

[52]  Xuekui Zhang,et al.  Simultaneous prediction of multiple outcomes using revised stacking algorithms , 2019, Bioinform..

[53]  Saso Dzeroski,et al.  An extensive experimental comparison of methods for multi-label learning , 2012, Pattern Recognit..

[54]  Q. Zou,et al.  Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA , 2018, RNA.

[55]  Quan Zou,et al.  SecProMTB: Support Vector Machine‐Based Classifier for Secretory Proteins Using Imbalanced Data Sets Applied to Mycobacterium tuberculosis , 2019, Proteomics.

[56]  Doheon Lee,et al.  Prediction of compound-target interactions of natural products using large-scale drug and protein information , 2016, BMC Bioinformatics.

[57]  Sebastián Ventura,et al.  Multi‐label learning: a review of the state of the art and ongoing research , 2014, WIREs Data Mining Knowl. Discov..

[58]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[59]  Vasanthanathan Poongavanam,et al.  Fingerprint-based in silico models for the prediction of P-glycoprotein substrates and inhibitors , 2012, Bioorganic & medicinal chemistry.

[60]  Roman Rouzier,et al.  Phase II study of tariquidar, a selective P‐glycoprotein inhibitor, in patients with chemotherapy‐resistant, advanced breast carcinoma , 2005, Cancer.

[61]  Myeong Ok Kim,et al.  iBCE-EL: A New Ensemble Learning Framework for Improved Linear B-Cell Epitope Prediction , 2018, Front. Immunol..

[62]  K. Chou,et al.  iCDI-PseFpt: identify the channel-drug interaction in cellular networking with PseAAC and molecular fingerprints. , 2013, Journal of theoretical biology.

[63]  N. Taleb Black Swans and the Domains of Statistics , 2007 .

[64]  Hong Cai,et al.  Metrical Consistency NMF for Predicting Gene–Phenotype Associations , 2017, Interdisciplinary Sciences: Computational Life Sciences.

[65]  Yong Huang,et al.  Identifying Multi-Functional Enzyme by Hierarchical Multi-Label Classifier , 2013 .

[66]  Grigorios Tsoumakas,et al.  Random K-labelsets for Multilabel Classification , 2022 .

[67]  Xinyi Liu,et al.  Deep-Resp-Forest: A deep forest model to predict anti-cancer drug response. , 2019, Methods.

[68]  Tao Zeng,et al.  Prediction of heme binding residues from protein sequences with integrative sequence profiles , 2012, Proteome Science.

[69]  Eyke Hüllermeier,et al.  Exploiting HIV-1 protease and reverse transcriptase cross-resistance information for improved drug resistance prediction by means of multi-label classification , 2016, BioData Mining.

[70]  M. Niemi,et al.  Membrane transporters in drug development , 2010, Nature Reviews Drug Discovery.

[71]  Grigorios Tsoumakas,et al.  On the Stratification of Multi-label Data , 2011, ECML/PKDD.

[72]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[73]  Gwang Lee,et al.  AIPpred: Sequence-Based Prediction of Anti-inflammatory Peptides Using Random Forest , 2018, Front. Pharmacol..

[74]  Prabha Garg,et al.  Selective Fusion of Heterogeneous Classifiers for Predicting Substrates of Membrane Transporters , 2017, J. Chem. Inf. Model..

[75]  Wicher Bergsma,et al.  A bias-correction for Cramér’s and Tschuprow’s , 2013 .

[76]  Yi Xiong,et al.  A Hadoop-Based Method to Predict Potential Effective Drug Combination , 2014, BioMed research international.

[77]  Günter Klambauer,et al.  DeepTox: Toxicity Prediction using Deep Learning , 2016, Front. Environ. Sci..

[78]  Min-Ling Zhang,et al.  A Review on Multi-Label Learning Algorithms , 2014, IEEE Transactions on Knowledge and Data Engineering.

[79]  Chris Morley,et al.  Open Babel: An open chemical toolbox , 2011, J. Cheminformatics.

[80]  Gwang Lee,et al.  PVP-SVM: Sequence-Based Prediction of Phage Virion Proteins Using a Support Vector Machine , 2018, Front. Microbiol..

[81]  M. Gottesman,et al.  Targeting multidrug resistance in cancer , 2006, Nature Reviews Drug Discovery.

[82]  Hui Zhang,et al.  A prediction model of substrates and non-substrates of breast cancer resistance protein (BCRP) developed by GA-CG-SVM method , 2011, Comput. Biol. Medicine.

[83]  Krzysztof J. Cios,et al.  Review of ensembles of multi-label classifiers: Models, experimental study and prospects , 2018, Inf. Fusion.

[84]  Kuo-Chen Chou,et al.  Some remarks on predicting multi-label attributes in molecular biosystems. , 2013, Molecular bioSystems.

[85]  Wei Tao,et al.  A comprehensive comparison and analysis of computational predictors for RNA N6-methyladenosine sites of Saccharomyces cerevisiae. , 2019, Briefings in functional genomics.

[86]  Thomas Blaschke,et al.  The rise of deep learning in drug discovery. , 2018, Drug discovery today.

[87]  Zhi-Hua Zhou,et al.  ML-KNN: A lazy learning approach to multi-label learning , 2007, Pattern Recognit..

[88]  Andrew G. Mercader,et al.  Integrated Application of Enhanced Replacement Method and Ensemble Learning for the Prediction of BCRP/ABCG2 Substrates , 2017 .

[89]  Yi Xiong,et al.  PDC-SGB: Prediction of effective drug combinations using a stochastic gradient boosting algorithm. , 2017, Journal of theoretical biology.

[90]  Dan Li,et al.  ADMET evaluation in drug discovery. 13. Development of in silico prediction models for P-glycoprotein substrates. , 2014, Molecular pharmaceutics.

[91]  Quan Zou,et al.  Incorporating Distance-based Top-n-gram and Random Forest to Identify Electron Transport Proteins. , 2019, Journal of proteome research.

[92]  Zhenyu Xu,et al.  ATC-NLSP: Prediction of the Classes of Anatomical Therapeutic Chemicals Using a Network-Based Label Space Partition Method , 2019, Front. Pharmacol..

[93]  C. George Priya Doss,et al.  Rising Strengths Hong Kong SAR in Bioinformatics , 2016, Interdisciplinary Sciences: Computational Life Sciences.

[94]  Dong-Qing Wei,et al.  Prediction of CYP450 Enzyme-Substrate Selectivity Based on the Network-Based Label Space Division Method , 2019, J. Chem. Inf. Model..

[95]  Réka Albert,et al.  Near linear time algorithm to detect community structures in large-scale networks. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[96]  Feng Huang,et al.  SFPEL-LPI: Sequence-based feature projection ensemble learning for predicting LncRNA-protein interactions , 2018, PLoS Comput. Biol..

[97]  Yanlin Chen,et al.  SFLLN: A sparse feature learning ensemble method with linear neighborhood regularization for predicting drug-drug interactions , 2019, Inf. Sci..

[98]  Ping Wang,et al.  Discriminating ramos and jurkat cells with image textures from diffraction imaging flow cytometry based on a support vector machine , 2016 .

[99]  Quan Zou,et al.  HPSLPred: An Ensemble Multi‐Label Classifier for Human Protein Subcellular Location Prediction with Imbalanced Source , 2017, Proteomics.

[100]  Hui Ding,et al.  A Random Forest Sub-Golgi Protein Classifier Optimized via Dipeptide and Amino Acid Composition Features , 2019, Front. Bioeng. Biotechnol..

[101]  João D. Ferreira,et al.  Semantic Similarity for Automatic Classification of Chemical Compounds , 2010, PLoS Comput. Biol..

[102]  H. Uno,et al.  Zosuquidar, a novel modulator of P-glycoprotein, does not improve the outcome of older patients with newly diagnosed acute myeloid leukemia: a randomized, placebo-controlled trial of the Eastern Cooperative Oncology Group 3999. , 2010, Blood.

[103]  Johann Gasteiger,et al.  Comparison of Multilabel and Single-Label Classification Applied to the Prediction of the Isoform Specificity of Cytochrome P450 Substrates , 2009, J. Chem. Inf. Model..

[104]  Yi Xiong,et al.  GOLabeler: Improving Sequence-based Large-scale Protein Function Prediction by Learning to Rank , 2017, bioRxiv.

[105]  Leyi Wei,et al.  Meta-4mCpred: A Sequence-Based Meta-Predictor for Accurate DNA 4mC Site Prediction Using Effective Feature Representation , 2019, Molecular therapy. Nucleic acids.

[106]  Jiahui Liu,et al.  GC$$^2$$2NMF: A Novel Matrix Factorization Framework for Gene–Phenotype Association Prediction , 2018, Interdisciplinary Sciences: Computational Life Sciences.

[107]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[108]  Larry D. Hostetler,et al.  Optimization of k nearest neighbor density estimates , 1973, IEEE Trans. Inf. Theory.

[109]  Andreas Bender,et al.  Metrabase: a cheminformatics and bioinformatics database for small molecule transporter data analysis and (Q)SAR modeling , 2015, Journal of Cheminformatics.

[110]  Yanqing Niu,et al.  Quantitative prediction of drug side effects based on drug-related features , 2017, Interdisciplinary Sciences: Computational Life Sciences.

[111]  Yuan-Hai Shao,et al.  MLTSVM: A novel twin support vector machine to multi-label learning , 2016, Pattern Recognit..

[112]  Ran Su,et al.  Iterative feature representations improve N4-methylcytosine site prediction , 2019, Bioinform..