Comparison of Machine Learning Algorithms and Oversampling Techniques for Urinary Toxicity Prediction After Prostate Cancer Radiotherapy

Prostate cancer radiotherapy unavoidably involves the irradiation not only of the target volume, but also of healthy organs-at-risk, neighboring the prostate, likely causing adverse, toxicity-related side-effects. Specifically, in the case of urinary toxicity, these side effects might be associated with a variety of dosimetric, clinical and genetic factors, making its prediction particularly challenging. Given the inconsistency of available data concerning radiation-induced toxicity, it is crucial to develop robust models with superior predictive performance in order to perform tailored treatments. Machine Learning techniques emerge as appealing in this context, nevertheless without any consensus on the best algorithms to be used. This work proposes a comparison of several machine-learning strategies together with different minority class oversampling techniques for prediction of urinary toxicity following prostate cancer radiotherapy using dosimetric and clinical data. The performance of these classifiers was evaluated on the original dataset and using four different synthetic oversampling techniques. The area under the ROC curve (AUC) and the F-measure were employed to evaluate their performance. Results suggest that, regardless of the technique, oversampling always increases the prediction performance of the models (p=0.004). Overall, oversampling with Synthetic Minority Oversampling Technique (SMOTE) followed by Edited Nearest Neighbour algorithm (ENN) together with Regularized Discriminant Analysis (RDA) classifier provide the best performance (AUC=0.71).

[1]  P. Maingon,et al.  70 Gy versus 80 Gy in localized prostate cancer: 5-year results of GETUG 06 randomized trial. , 2011, International journal of radiation oncology, biology, physics.

[2]  David J. Hand,et al.  Measuring classifier performance: a coherent alternative to the area under the ROC curve , 2009, Machine Learning.

[3]  Ajinkya More,et al.  Survey of resampling techniques for improving classification performance in unbalanced datasets , 2016, ArXiv.

[4]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[5]  Lei Dong,et al.  Long-term results of the M. D. Anderson randomized dose-escalation trial for prostate cancer. , 2008, International journal of radiation oncology, biology, physics.

[6]  Gianluca Bontempi,et al.  When is Undersampling Effective in Unbalanced Classification Tasks? , 2015, ECML/PKDD.

[7]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[8]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[9]  Trevor Hastie,et al.  Additive Logistic Regression : a Statistical , 1998 .

[10]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[11]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[12]  H. Thierens,et al.  Integrated models for the prediction of late genitourinary complaints after high-dose intensity modulated radiotherapy for prostate cancer: making informed decisions. , 2014, Radiotherapy and oncology : journal of the European Society for Therapeutic Radiology and Oncology.

[13]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[14]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[15]  Sulin Pang,et al.  C5.0 Classification Algorithm and Application on Individual Credit Evaluation of Banks , 2009 .

[16]  Claudio Fiorino,et al.  Predictive models of toxicity in external radiotherapy , 2009, Cancer.

[17]  B. Efron Bayes' Theorem in the 21st Century , 2013, Science.

[18]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[19]  Andrea Bellincontro,et al.  Partial least squares discriminant analysis: A dimensionality reduction method to classify hyperspectral data , 2018, 1806.09347.

[20]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[21]  G. Dumancas Comparison of machine-learning techniques for handling multicollinearity in big data analytics and high-performance data mining , 2015 .

[22]  T. Rosewall,et al.  The relationship between external beam radiotherapy dose and chronic urinary dysfunction--a methodological critique. , 2010, Radiotherapy and oncology : journal of the European Society for Therapeutic Radiology and Oncology.

[23]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[24]  M. Parmar,et al.  Escalated-dose versus control-dose conformal radiotherapy for prostate cancer: long-term results from the MRC RT01 randomised controlled trial. , 2014, The Lancet. Oncology.

[25]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[26]  J. Friedman Regularized Discriminant Analysis , 1989 .

[27]  José Hernández-Orallo,et al.  An experimental comparison of performance measures for classification , 2009, Pattern Recognit. Lett..

[28]  N. Magné,et al.  Voxel-Based Analysis for Identification of Urethrovesical Subregions Predicting Urinary Toxicity After Prostate Cancer Radiation Therapy. , 2019, International journal of radiation oncology, biology, physics.