Performance evaluation is an important aspect of the machine learning process. However, it is a complex task. It, therefore, needs to be conducted carefully in order for the application of machine learning to radiation oncology or other domains to be reliable. This chapter introduces the issue and discusses some of the most commonly used techniques that have been applied to it. The focus is on the three main subtasks of evaluation: measuring performance, resampling the data, and assessing the statistical significance of the results. In the context of the first subtask, the chapter discusses some of the confusion matrix-based measures (accuracy, precision, recall or sensitivity, and false alarm rate) as well as receiver operating characteristic (ROC) analysis; several error estimation or resampling techniques belonging to the cross-validation family as well as bootstrapping are involved in the context of the second subtask. Finally, a number of nonparametric statistical tests including McNemar’s test, Wilcoxon’s signed-rank test, and Friedman’s test are covered in the context of the third subtask. The chapter concludes with a discussion of the limitations of the evaluation process.
[1]
Pat Langley,et al.
Machine learning as an experimental science
,
2004,
Machine Learning.
[2]
Mohak Shah,et al.
Evaluating Learning Algorithms: A Classification Perspective
,
2011
.
[3]
Yunqian Ma,et al.
Imbalanced Learning: Foundations, Algorithms, and Applications
,
2013
.
[4]
Thomas G. Dietterich.
Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms
,
1998,
Neural Computation.
[5]
Remco R. Bouckaert,et al.
Choosing Between Two Learning Algorithms Based on Calibrated Tests
,
2003,
ICML.
[6]
Haibo He,et al.
Assessment Metrics for Imbalanced Learning
,
2013
.