Flexible Data Trimming Improves Performance of Global Machine Learning Methods in Omics-Based Personalized Oncology

(1) Background: Machine learning (ML) methods are rarely used for an omics-based prescription of cancer drugs, due to shortage of case histories with clinical outcome supplemented by high-throughput molecular data. This causes overtraining and high vulnerability of most ML methods. Recently, we proposed a hybrid global-local approach to ML termed floating window projective separator (FloWPS) that avoids extrapolation in the feature space. Its core property is data trimming, i.e., sample-specific removal of irrelevant features. (2) Methods: Here, we applied FloWPS to seven popular ML methods, including linear SVM, k nearest neighbors (kNN), random forest (RF), Tikhonov (ridge) regression (RR), binomial naïve Bayes (BNB), adaptive boosting (ADA) and multi-layer perceptron (MLP). (3) Results: We performed computational experiments for 21 high throughput gene expression datasets (41–235 samples per dataset) totally representing 1778 cancer patients with known responses on chemotherapy treatments. FloWPS essentially improved the classifier quality for all global ML methods (SVM, RF, BNB, ADA, MLP), where the area under the receiver-operator curve (ROC AUC) for the treatment response classifiers increased from 0.61–0.88 range to 0.70–0.94. We tested FloWPS-empowered methods for overtraining by interrogating the importance of different features for different ML methods in the same model datasets. (4) Conclusions: We showed that FloWPS increases the correlation of feature importance between the different ML methods, which indicates its robustness to overtraining. For all the datasets tested, the best performance of FloWPS data trimming was observed for the BNB method, which can be valuable for further building of ML classifiers in personalized oncology.

[1]  Bhupinder Bhullar,et al.  Molecular pathway activation features linked with transition from normal skin to primary and metastatic melanomas in human , 2015, Oncotarget.

[2]  Nikolay Borisov,et al.  Individual Drug Treatment Prediction in Oncology Based on Machine Learning Using Cell Culture Gene Expression Data , 2017, ICCBB.

[3]  Gastone Castellani,et al.  The genetic and genomic background of multiple myeloma patients achieving complete response after induction therapy with bortezomib, thalidomide and dexamethasone (VTD) , 2015, Oncotarget.

[4]  Anthony Boral,et al.  Gene expression profiling and correlation with outcome in clinical trials of the proteasome inhibitor bortezomib. , 2006, Blood.

[5]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[6]  M. Dowsett,et al.  Accurate Prediction and Validation of Response to Endocrine Therapy in Breast Cancer. , 2015, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[7]  J. S. Cramer The Origins of Logistic Regression , 2002 .

[8]  Nicolas Borisov,et al.  New Paradigm of Machine Learning (ML) in Personalized Oncology: Data Trimming for Squeezing More Biomarkers From Clinical Datasets , 2019, Front. Oncol..

[9]  Nicolas Borisov,et al.  Shambhala: a platform-agnostic data harmonizer for gene expression data , 2019, BMC Bioinformatics.

[10]  Nikolay M. Borisov,et al.  Pathway Based Analysis of Mutation Data Is Efficient for Scoring Target Cancer Drugs , 2019, Front. Pharmacol..

[11]  G. Molenberghs,et al.  Type I and Type II Error Under Random‐Effects Misspecification in Generalized Linear Mixed Models , 2007, Biometrics.

[12]  Hae-Young Kim Statistical notes for clinical researchers: Type I and type II errors in statistical decision , 2015, Restorative dentistry & endodontics.

[13]  Yuan Qi,et al.  Cell Line Derived Multi-Gene Predictor of Pathologic Response to Neoadjuvant Chemotherapy in Breast Cancer: A Validation Study on US Oncology 02-103 Clinical Trial , 2012, BMC Medical Genomics.

[14]  J. Jakobsen,et al.  Trial Sequential Analysis in systematic reviews with meta-analysis , 2017, BMC Medical Research Methodology.

[15]  Amir Samii,et al.  Molecular pathway activation - New type of biomarkers for tumor morphology and personalized selection of target drugs. , 2018, Seminars in cancer biology.

[16]  Federico Girosi,et al.  An improved training algorithm for support vector machines , 1997, Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop.

[17]  John P A Ioannidis,et al.  Optimal type I and type II error pairs when the available sample size is fixed. , 2013, Journal of clinical epidemiology.

[18]  Mary Goldman,et al.  The UCSC Cancer Genomics Browser: update 2015 , 2014, Nucleic Acids Res..

[19]  M. Sorokin,et al.  RNA sequencing for research and diagnostics in clinical oncology. , 2020, Seminars in cancer biology.

[20]  T. Reynoldson,et al.  Evaluating the Type II error rate in a sediment toxicity classification using the Reference Condition Approach. , 2011, Aquatic toxicology.

[21]  S. Stigler,et al.  The History of Statistics: The Measurement of Uncertainty before 1900 , 1986 .

[22]  Nicolas Borisov,et al.  High-Throughput Mutation Data Now Complement Transcriptomic Profiling: Advances in Molecular Pathway Activation Analysis Approach in Cancer Biology , 2019, Cancer informatics.

[23]  Yuan Qi,et al.  Gene pathways associated with prognosis and chemotherapy sensitivity in molecular subtypes of breast cancer. , 2011, Journal of the National Cancer Institute.

[24]  Rieko Arimoto,et al.  Development of CYP3A4 Inhibition Models: Comparisons of Machine-Learning Techniques and Molecular Descriptors , 2005, Journal of biomolecular screening.

[25]  Nicolas Borisov,et al.  A method for predicting target drug efficiency in cancer based on the analysis of signaling pathway activation , 2015, Oncotarget.

[26]  M. Hazinski,et al.  Guidelines based on fear of type II (false-negative) errors. Why we dropped the pulse check for lay rescuers. , 2000, Resuscitation.

[27]  Yi Li,et al.  Gene Expression Profile Alone Is Inadequate In Predicting Complete Response In Multiple Myeloma , 2014, Leukemia.

[28]  David C. Atkins,et al.  Identification of Molecular Predictors of Response in a Study of Tipifarnib Treatment in Relapsed and Refractory Acute Myelogenous Leukemia , 2007, Clinical Cancer Research.

[29]  M. Hazinski,et al.  Guidelines based on fear of type II (false-negative) errors : why we dropped the pulse check for lay rescuers. , 2000, Circulation.

[30]  Alex Zhavoronkov,et al.  A method of gene expression data transfer from cell lines to cancer patients for machine-learning prediction of drug efficiency , 2018, Cell cycle.

[31]  Thomas Lengauer,et al.  Classification with correlated features: unreliability of feature ranking and solutions , 2011, Bioinform..

[32]  Zhi Wei,et al.  A transfer learning approach via procrustes analysis and mean shift for cancer drug sensitivity prediction , 2018, J. Bioinform. Comput. Biol..

[33]  George Potamias,et al.  Gene Selection via Discretized Gene-Expression Profiles and Greedy Feature-Elimination , 2004, SETN.

[34]  John Shawe-Taylor,et al.  Generalization Performance of Support Vector Machines and Other Pattern Classifiers , 1999 .

[35]  D B Evans,et al.  Sequential changes in gene expression profiles in breast cancers during treatment with the aromatase inhibitor, letrozole , 2010, The Pharmacogenomics Journal.

[36]  Richard A. Moore,et al.  Recurrent DGCR8, DROSHA, and SIX homeodomain mutations in favorable histology Wilms tumors. , 2015, Cancer cell.

[37]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[38]  N. V. Zhukov,et al.  Targeted therapy in the treatment of solid tumors: Practice contradicts theory , 2008, Biochemistry (Moscow).

[39]  K. Camphausen,et al.  Gene expression pathway analysis to predict response to neoadjuvant docetaxel and capecitabine for breast cancer , 2010, Breast Cancer Research and Treatment.

[40]  Nicolas Borisov,et al.  Prediction of Drug Efficiency by Transferring Gene Expression Data from Cell Lines to Cancer Patients , 2017, Braverman Readings in Machine Learning.

[41]  Ilya Muchnik,et al.  FLOating-Window Projective Separator (FloWPS): A Data Trimming Tool for Support Vector Machines (SVM) to Improve Robustness of the Classifier , 2019, Front. Genet..

[42]  P. Esker,et al.  Statistical Power in Plant Pathology Research. , 2018, Phytopathology.

[43]  L. Pusztai,et al.  Estrogen receptor (ER) mRNA expression and molecular subtype distribution in ER-negative/progesterone receptor-positive breast cancers , 2014, Breast Cancer Research and Treatment.

[44]  Alex Deng,et al.  A note on Type S/M errors in hypothesis testing. , 2019, The British journal of mathematical and statistical psychology.

[45]  Parantu K. Shah,et al.  A small molecule inhibitor of ubiquitin-specific protease-7 induces apoptosis in multiple myeloma cells and overcomes bortezomib resistance. , 2012, Cancer cell.

[46]  Turki Turki,et al.  Clinical intelligence: New machine learning techniques for predicting clinical drug response , 2019, Comput. Biol. Medicine.

[47]  Gary D Bader,et al.  Seventeen-gene signature from enriched Her2/Neu mammary tumor-initiating cells predicts clinical outcome for human HER2+:ERα− breast cancer , 2012, Proceedings of the National Academy of Sciences.

[48]  Markus Müller,et al.  Bioinformatics for protein biomarker panel classification: what is needed to bring biomarker panels into in vitro diagnostics? , 2009, Expert review of proteomics.

[49]  Leming Shi,et al.  Effect of training-sample size and classification difficulty on the accuracy of genomic predictors , 2010, Breast Cancer Research.

[50]  K. Tomczak,et al.  The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge , 2015, Contemporary oncology.

[51]  F. Santosa,et al.  Linear inversion of ban limit reflection seismograms , 1986 .

[52]  Abraham Yosipof,et al.  Data Mining and Machine Learning Models for Predicting Drug Likeness and Their Disease or Organ Category , 2018, Front. Chem..

[53]  Zhi Wei,et al.  A link prediction approach to cancer drug sensitivity prediction , 2017, BMC Systems Biology.

[54]  Solomon Tesfamariam,et al.  Predicting copper concentrations in acid mine drainage: a comparative analysis of five machine learning techniques , 2013, Environmental Monitoring and Assessment.

[55]  W. Miller,et al.  Changes in expression of oestrogen regulated and proliferation genes with neoadjuvant treatment highlight heterogeneity of clinical resistance to the aromatase inhibitor, letrozole , 2010, Breast Cancer Research.

[56]  Houman Owhadi,et al.  Optimal uncertainty quantification for legacy data observations of Lipschitz functions , 2012, ArXiv.

[57]  L. Esserman,et al.  A genomic predictor of response and survival following taxane-anthracycline chemotherapy for invasive breast cancer. , 2011, JAMA.

[58]  Zhi Wei,et al.  Transfer Learning Approaches to Improve Drug Sensitivity Prediction in Multiple Myeloma Patients , 2017, IEEE Access.

[59]  Roman M. Balabin,et al.  Support vector machine regression (LS-SVM)--an alternative to artificial neural networks (ANNs) for the analysis of quantum chemistry data? , 2011, Physical chemistry chemical physics : PCCP.

[60]  C. Anders,et al.  Biologic and clinical characteristics of adolescent and young adult cancers: Acute lymphoblastic leukemia, colorectal cancer, breast cancer, melanoma, and sarcoma , 2016, Cancer.

[61]  Alioune Ngom,et al.  A Machine Learning Approach for Identifying Gene Biomarkers Guiding the Treatment of Breast Cancer , 2019, Front. Genet..

[62]  S. Stigler,et al.  The History of Statistics: The Measurement of Uncertainty before 1900 by Stephen M. Stigler (review) , 1986, Technology and Culture.

[63]  Roman M. Balabin,et al.  Interpolation and extrapolation problems of multivariate regression in analytical chemistry: benchmarking the robustness on near-infrared (NIR) spectroscopy data. , 2012, The Analyst.

[64]  Hongbin Yang,et al.  In Silico Prediction of Blood–Brain Barrier Permeability of Compounds by Machine Learning and Resampling Methods , 2018, ChemMedChem.

[65]  H. Chitsaz,et al.  DeePathology: Deep Multi-Task Learning for Inferring Molecular Pathology from Cancer Transcriptome , 2018, Scientific Reports.

[66]  Houman Owhadi,et al.  Optimal Uncertainty Quantification , 2010, SIAM Rev..

[67]  Shinzaburo Noguchi,et al.  GSTP1 expression predicts poor pathological complete response to neoadjuvant chemotherapy in ER‐negative breast cancer , 2012, Cancer science.

[68]  L. Pusztai,et al.  Biomarker Analysis of Neoadjuvant Doxorubicin/Cyclophosphamide Followed by Ixabepilone or Paclitaxel in Early-Stage Breast Cancer , 2013, Clinical Cancer Research.

[69]  Geert Molenberghs,et al.  Type I and Type II Error Under Random‐Effects Misspecification in Generalized Linear Mixed Models , 2007, Biometrics.

[70]  Melanie Hilario,et al.  Mining mass spectra for diagnosis and biomarker discovery of cerebral accidents , 2004, Proteomics.