Many criteria can be used to evaluate the performance of supervised learning. Different criteria are appropriate in different settings, and it is not always clear which criteria to use. A further complication is that learning methods that perform well on one criterion may not perform well on other criteria. For example, SVMs and boosting are designed to optimize accuracy, whereas neural nets typically optimize squared error or cross entropy. We conducted an empirical study using a variety of learning methods (SVMs, neural nets, k-nearest neighbor, bagged and boosted trees, and boosted stumps) to compare nine boolean classification performance metrics: Accuracy, Lift, F-Score, Area under the ROC Curve, Average Precision, Precision/Recall Break-Even Point, Squared Error, Cross Entropy, and Probability Calibration. Multidimensional scaling (MDS) shows that these metrics span a low dimensional manifold. The three metrics that are appropriate when predictions are interpreted as probabilities: squared error, cross entropy, and calibration, lay in one part of metric space far away from metrics that depend on the relative order of the predicted values: ROC area, average precision, break-even point, and lift. In between them fall two metrics that depend on comparing predictions to a threshold: accuracy and F-score. As expected, maximum margin methods such as SVMs and boosted trees have excellent performance on metrics like accuracy, but perform poorly on probability metrics such as squared error. What was not expected was that the margin methods have excellent performance on ordering metrics such as ROC area and average precision. We introduce a new metric, SAR, that combines squared error, accuracy, and ROC area into one metric. MDS and correlation analysis shows that SAR is centrally located and correlates well with other metrics, suggesting that it is a good general purpose metric to use when more specific criteria are not known.
[1]
Pedro M. Domingos,et al.
Tree Induction for Probability-Based Ranking
,
2003,
Machine Learning.
[2]
Catherine Blake,et al.
UCI Repository of machine learning databases
,
1998
.
[3]
Thorsten Joachims,et al.
Making large scale SVM learning practical
,
1998
.
[4]
Robert F. Cromp,et al.
Support Vector Machine Classifiers as Applied to AVIRIS Data
,
1999
.
[5]
Stephen E. Fienberg,et al.
The Comparison and Evaluation of Forecasters.
,
1983
.
[6]
Tom Fawcett,et al.
Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions
,
1997,
KDD.
[7]
Thomas D. Sandry,et al.
Applied Data Mining
,
2005,
Technometrics.
[8]
Peter A. Flach.
The Geometry of ROC Space: Understanding Machine Learning Metrics through ROC Isometrics
,
2003,
ICML.
[9]
John Platt,et al.
Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods
,
1999
.