Classification-algorithm evaluation: five performance measures based on confusion matrices.

Objective. The objective of this paper is to introduce, explain, and extend methods for comparing the performance of classification algorithms using error tallies obtained on properly sized, populated, and labeled data sets.Methods. Two distinct contexts of classification are defined, involving “objects-by-inspection” and “objects-by-segmentation.” In the former context, the total number of objects to be classified is unambiguously and self-evidently defined. In the latter, there is troublesome ambiguity. All five of the measures of performance here considered are based on confusion matrices, tables of counts revealing the extent of an algorithm's “confusion” regarding the true classifications. A proper measure of classification-algorithm performancemust meet four requirements. A proper measureshould obey six additional constraints.Results. Four traditional measures of performance are critiqued in terms of the requirements and constraints. Each measure meets the requirements, but fails to obey at least one of the constraints. A nontraditional measure of algorithm performance, the normalized mutual information (NMI), is therefore introduced. Based on the NMI, methods for comparing algorithm performance using confusion matrices are devised.Conclusions. The five performance measures lead to similar inferences when comparing a trio of QRS-detection algorithms using a large data set. The modified NMI is preferred, however, because it obeys each of the constraints and is the most conservative measure of performance.