OncoRTT: Predicting novel oncology-related therapeutic targets using BERT embeddings and omics features

Late-stage drug development failures are usually a consequence of ineffective targets. Thus, proper target identification is needed, which may be possible using computational approaches. The reason being, effective targets have disease-relevant biological functions, and omics data unveil the proteins involved in these functions. Also, properties that favor the existence of binding between drug and target are deducible from the protein’s amino acid sequence. In this work, we developed OncoRTT, a deep learning (DL)-based method for predicting novel therapeutic targets. OncoRTT is designed to reduce suboptimal target selection by identifying novel targets based on features of known effective targets using DL approaches. First, we created the “OncologyTT” datasets, which include genes/proteins associated with ten prevalent cancer types. Then, we generated three sets of features for all genes: omics features, the proteins’ amino-acid sequence BERT embeddings, and the integrated features to train and test the DL classifiers separately. The models achieved high prediction performances in terms of area under the curve (AUC), i.e., AUC greater than 0.88 for all cancer types, with a maximum of 0.95 for leukemia. Also, OncoRTT outperformed the state-of-the-art method using their data in five out of seven cancer types commonly assessed by both methods. Furthermore, OncoRTT predicts novel therapeutic targets using new test data related to the seven cancer types. We further corroborated these results with other validation evidence using the Open Targets Platform and a case study focused on the top-10 predicted therapeutic targets for lung cancer.

[1]  Llion Jones,et al.  ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Maha A. Thafar,et al.  MetastaSite: Predicting metastasis to different sites using deep learning with gene expression data , 2022, Frontiers in Molecular Biosciences.

[3]  Maha A. Thafar,et al.  Combining biomedical knowledge graphs and text to improve predictions for drug-target interactions and drug-indications , 2022, PeerJ.

[4]  Maha A. Thafar,et al.  Affinity2Vec: drug-target binding affinity prediction through representation learning, graph mining, and machine learning , 2022, Scientific Reports.

[5]  Maha A. Thafar Drug Repositioning through the Development of Diverse Computational Methods using Machine Learning, Deep Learning, and Graph Mining , 2022 .

[6]  IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB 2022, Ottawa, ON, Canada, August 15-17, 2022 , 2022, CIBCB.

[7]  Maha A. Thafar,et al.  Predicting Bone Metastasis Using Gene Expression-Based Machine Learning Models , 2021, Frontiers in Genetics.

[8]  Maha A. Thafar,et al.  StackACPred: Prediction of anticancer peptides by integrating optimized multiple feature descriptors with stacked ensemble approach , 2021, Chemometrics and Intelligent Laboratory Systems.

[9]  Maha A. Thafar,et al.  DTi2Vec: Drug–target interaction prediction using network embedding and ensemble learning , 2021, Journal of Cheminformatics.

[10]  Maha A. Thafar,et al.  Machine learning and deep learning methods that use omics data for metastasis prediction , 2021, Computational and structural biotechnology journal.

[11]  Yu-Yen Ou,et al.  TRP-BERT: Discrimination of transient receptor potential (TRP) channels using contextual representations from deep bidirectional transformer based on BERT , 2021, Comput. Biol. Medicine.

[12]  Maha A. Thafar,et al.  MetaCancer: A deep learning-based pan-cancer metastasis prediction model developed using multi-omics data , 2021, Computational and structural biotechnology journal.

[13]  Chenwei Li,et al.  Tryptophan and Its Metabolites in Lung Cancer: Basic Functions and Clinical Significance , 2021, Frontiers in Oncology.

[14]  J. Marshall,et al.  Global mapping of cancers: The Cancer Genome Atlas and beyond , 2021, Molecular oncology.

[15]  Kevin K. Yang,et al.  Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets , 2021, Current protocols.

[16]  Swakkhar Shatabda,et al.  Convolutional neural networks with image representation of amino acid sequences for protein function prediction , 2021, Comput. Biol. Chem..

[17]  R. Chiari,et al.  Kynurenine/Tryptophan Ratio as a Potential Blood-Based Biomarker in Non-Small Cell Lung Cancer , 2021, International journal of molecular sciences.

[18]  Maxat Kulmanov,et al.  DeepGOPlus: improved protein function prediction from sequence , 2021, Bioinform..

[19]  Maxat Kulmanov,et al.  DeepMOCCA: A pan-cancer prognostic model identifies personalized prognostic markers through graph attention and multi-omics data integration , 2021, bioRxiv.

[20]  F. Gemignani,et al.  Identification of Overexpressed Genes in Malignant Pleural Mesothelioma , 2021, International journal of molecular sciences.

[21]  Chanin Nantasenamat,et al.  BERT4Bitter: a bidirectional encoder representations from transformers (BERT)-based model for improving the prediction of bitter peptides , 2021, Bioinform..

[22]  Magbubah Essack,et al.  Application and evaluation of knowledge graph embeddings in biomedical data , 2021, PeerJ Comput. Sci..

[23]  Yu-Yen Ou,et al.  GT-Finder: Classify the family of glucose transporters with pre-trained BERT language models , 2021, Comput. Biol. Medicine.

[24]  A. Jemal,et al.  Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries , 2021, CA: a cancer journal for clinicians.

[25]  Peter B. McGarvey,et al.  UniProt: the universal protein knowledgebase in 2021 , 2020, Nucleic Acids Res..

[26]  E. McDonagh,et al.  Open Targets Platform: supporting systematic drug–target identification and prioritisation , 2020, Nucleic Acids Res..

[27]  Jian Wang,et al.  Biomedical named entity recognition using BERT in the machine reading comprehension framework , 2020, J. Biomed. Informatics.

[28]  R. Jiang,et al.  DeepCDR: a hybrid graph convolutional network for predicting cancer drug response , 2020, bioRxiv.

[29]  Dan Leggate,et al.  Genome-wide investigation of gene-cancer associations for the prediction of novel therapeutic targets in oncology , 2020, Scientific Reports.

[30]  Francesca Vitali,et al.  Integrated Multi-Omics Analyses in Oncology: A Review of Machine Learning Methods and Tools , 2020, Frontiers in Oncology.

[31]  Xin Gao,et al.  DTiGEMS+: drug–target interaction prediction using graph embedding, graph mining, and similarity-based techniques , 2020, Journal of Cheminformatics.

[32]  Mingwei Chen,et al.  Plasma adiponectin, visfatin, leptin, and resistin levels and the onset of colonic polyps in patients with prediabetes , 2020, BMC Endocrine Disorders.

[33]  A. Thakor,et al.  Reversing Acute Kidney Injury Using Pulsed Focused Ultrasound and MSC Therapy: A Role for HSP-Mediated PI3K/AKT Signaling , 2020, Molecular therapy. Methods & clinical development.

[34]  Hendrik Weisser,et al.  Genome-wide investigation of gene-cancer associations for the prediction of novel therapeutic targets in oncology , 2020, Scientific Reports.

[35]  Vladimir B. Bajic,et al.  Computational Drug-target Interaction Prediction based on Graph Embedding and Graph Mining , 2020 .

[36]  Jussi Paananen,et al.  An omics perspective on drug target discovery platforms , 2019, Briefings Bioinform..

[37]  Feng Zhu,et al.  Therapeutic target database 2020: enriched resource for facilitating research and early development of targeted therapeutics , 2019, Nucleic Acids Res..

[38]  Nuno A. Fonseca,et al.  Expression Atlas update: from tissues to single cells , 2019, Nucleic Acids Res..

[39]  M. Ceccarelli,et al.  Machine learning prediction of oncology drug targets based on protein and network properties , 2019, BMC Bioinformatics.

[40]  Xiaowei Wang,et al.  miRDB: an online database for prediction of functional microRNA targets , 2019, Nucleic Acids Res..

[41]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[42]  Jesse Davis,et al.  Learning from positive and unlabeled data: a survey , 2018, Machine Learning.

[43]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[44]  Maha A. Thafar,et al.  Metastatic State of Colorectal Cancer can be Accurately Predicted with Methylome , 2019, ICBRA.

[45]  Qin Zhang,et al.  Extracting comprehensive clinical information for breast cancer using deep learning methods , 2019, Int. J. Medical Informatics.

[46]  Vladimir B. Bajic,et al.  Comparison Study of Computational Prediction Tools for Drug-Target Binding Affinities , 2019, Front. Chem..

[47]  M. Wangpaichitr,et al.  Targeting the Kynurenine Pathway for the Treatment of Cisplatin-Resistant Lung Cancer , 2019, Molecular Cancer Research.

[48]  Alioune Ngom,et al.  A Machine Learning Approach for Identifying Gene Biomarkers Guiding the Treatment of Breast Cancer , 2019, Front. Genet..

[49]  T. Bivona,et al.  Polytherapy and Targeted Cancer Drug Resistance. , 2019, Trends in cancer.

[50]  Elena Papaleo,et al.  New functionalities in the TCGAbiolinks package for the study and integration of cancer data from GDC and GTEx , 2019, PLoS Comput. Biol..

[51]  Paul N. Schofield,et al.  Ontology-based prediction of cancer driver genes , 2019, Scientific Reports.

[52]  Manu Sebastian,et al.  Mouse Models of Overexpression Reveal Distinct Oncogenic Roles for Different Type I Protein Arginine Methyltransferases. , 2018, Cancer research.

[53]  Bing Niu,et al.  Identifying cancer targets based on machine learning methods via Chou's 5-steps rule and general pseudo components. , 2019, Current topics in medicinal chemistry.

[54]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[55]  J. Söding,et al.  Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold , 2018, bioRxiv.

[56]  François Chollet,et al.  Keras: The Python Deep Learning library , 2018 .

[57]  J. Söding,et al.  Clustering huge protein sequence sets in linear time , 2018, bioRxiv.

[58]  Maxat Kulmanov,et al.  DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier , 2017, Bioinform..

[59]  Michael Q. Ding,et al.  Precision Oncology beyond Targeted Therapy: Combining Omics Data with Machine Learning Matches the Majority of Cancer Cells to Effective Therapeutics , 2017, Molecular Cancer Research.

[60]  Raul Rodriguez-Esteban,et al.  Differential gene expression in disease: a comparison between high-throughput studies and the literature , 2017, BMC Medical Genomics.

[61]  P. Sanseau,et al.  In silico prediction of novel therapeutic targets using gene–disease association data , 2017, Journal of Translational Medicine.

[62]  Hala Fawzy Mohamed Kamel,et al.  Exploitation of Gene Expression and Cancer Biomarkers in Paving the Path to Era of Personalized Medicine , 2017, Genom. Proteom. Bioinform..

[63]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[64]  Hyunju Lee,et al.  In silico re-identification of properties of drug target proteins , 2017, BMC Bioinformatics.

[65]  Anne E Carpenter,et al.  Opportunities and obstacles for deep learning in biology and medicine , 2017, bioRxiv.

[66]  R. Bhavana THE COST OF NEW DRUG DELIVERY AND DEVELOPMENT – A REVIEW , 2017 .

[67]  S. Kmoch,et al.  Autosomal Dominant Tubulointerstitial Kidney Disease. , 2017, Advances in chronic kidney disease.

[68]  Gautier Koscielny,et al.  Open Targets: a platform for therapeutic target identification and validation , 2016, Nucleic Acids Res..

[69]  R. Harrison,et al.  Phase II and phase III failures: 2013–2015 , 2016, Nature Reviews Drug Discovery.

[70]  Niroshini Nirmalan,et al.  “Omics”-Informed Drug and Biomarker Discovery: Opportunities, Challenges and Future Perspectives , 2016, Proteomes.

[71]  P. Korkolopoulou,et al.  Clinical significance of AGE-RAGE axis in colorectal cancer: associations with glyoxalase-I, adiponectin receptor expression and prognosis , 2016, BMC Cancer.

[72]  Gianluca Bontempi,et al.  TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data , 2015, Nucleic acids research.

[73]  Chia-Wei Hsu,et al.  Drug repositioning for non-small cell lung cancer by using machine learning algorithms and topological graph theory , 2016, BMC Bioinformatics.

[74]  Jing Hu,et al.  TSC_ATP: A two-stage classifier for predicting protein-ATP binding sites from protein sequence , 2015, 2015 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB).

[75]  Hong Guo,et al.  Adiponectin Gene Polymorphisms are Associated with Increased Risk of Colorectal Cancer , 2015, Medical science monitor : international medical journal of experimental and clinical research.

[76]  P. Brennan,et al.  Circulating MicroRNAs as Non-Invasive Biomarkers for Early Detection of Non-Small-Cell Lung Cancer , 2015, PloS one.

[77]  Zhanchao Li,et al.  Large-scale identification of potential drug targets based on the topological features of human protein-protein interaction network. , 2015, Analytica chimica acta.

[78]  Andrew J. Doig,et al.  Properties of Protein Drug Target Classes , 2015, PloS one.

[79]  O. Kucuk,et al.  Cancer biomarkers. , 2015, Molecular aspects of medicine.

[80]  Abhigyan Nath,et al.  Identification of human drug targets using machine-learning algorithms , 2015, Comput. Biol. Medicine.

[81]  Wei Liu,et al.  Screening drug target proteins based on sequence information , 2014, J. Biomed. Informatics.

[82]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[83]  Sergio Contrino,et al.  InterMine: a flexible data warehouse system for the integration and analysis of heterogeneous biological data , 2012, Bioinform..

[84]  N. Henry,et al.  Cancer biomarkers , 2012, Molecular oncology.

[85]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[86]  Andrew J. Doig,et al.  Properties and identification of human protein drug targets , 2009, Bioinform..

[87]  Andrey Rzhetsky,et al.  Quantitative systems-level determinants of human genes targeted by successful drugs. , 2008, Genome research.

[88]  David S. Wishart,et al.  DrugBank: a knowledgebase for drugs, drug actions and drug targets , 2007, Nucleic Acids Res..

[89]  Gerta Rücker,et al.  y-Randomization and Its Variants in QSPR/QSAR , 2007, J. Chem. Inf. Model..

[90]  Nikhil R. Pal,et al.  Discovering biomarkers from gene expression data for predicting cancer subgroups using neural networks and relational fuzzy clustering , 2007, BMC Bioinformatics.

[91]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[92]  Meir Glick,et al.  Prediction of Biological Targets for Compounds Using Multiple-Category Bayesian Models Trained on Chemogenomics Databases , 2006, J. Chem. Inf. Model..

[93]  Vasant Honavar,et al.  Predicting DNA-binding sites of proteins from amino acid sequence , 2006, BMC Bioinformatics.

[94]  F. McGovern,et al.  Renal-cell carcinoma. , 2005, The New England journal of medicine.

[95]  M. Stratton,et al.  The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website , 2004, British Journal of Cancer.

[96]  J. Gibbs Mechanism-based target identification and drug discovery in cancer research. , 2000, Science.