Exact correlation between actual and estimated errors in discrete classification

Discrete classification problems are important in pattern recognition applications. The most often used discrete classification rule is the discrete histogram rule. In this letter we provide exact expressions for the correlation coefficient between the actual error and the resubstitution and leave-one-out cross-validation error estimators for the discrete histogram rule. We show with an example that correlations between actual and estimated errors are generally poor, and that in fact leave-one-out cross-validation can display negative correlation when sample sizes are small and classifier complexity is large. We observe that correlation decreases with increasing classifier complexity and increasing sample size does not necessarily produce an increase in correlation. The exact expressions given here can be computed reasonably fast for given sample size, dimensionality, and model parameters, which is useful because, as also illustrated in this letter, Monte-Carlo approximations of the correlation coefficient are generally poor, even at a large number of simulated data sets.

[1]  J. Klotz The Wilcoxon, Ties, and the Computer , 1966 .

[2]  A. Agresti [A Survey of Exact Inference for Contingency Tables]: Rejoinder , 1992 .

[3]  D. W. Roncek,et al.  Discrete Discriminant Analysis. , 1979 .

[4]  G. F. Hughes,et al.  On the mean accuracy of statistical pattern recognizers , 1968, IEEE Trans. Inf. Theory.

[5]  Wojtek J. Krzanowski,et al.  A comparison of discriminant procedures for binary variables , 2002 .

[6]  M. Hills,et al.  Discrimination and Allocation with Discrete Data , 1967 .

[7]  Blaise Hanczar,et al.  Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings , 2007, EURASIP J. Bioinform. Syst. Biol..

[8]  C. A. Smith Some examples of discrimination. , 1947, Annals of eugenics.

[9]  M. Hills Allocation Rules and Their Error Rates , 1966 .

[10]  G. Hughes,et al.  Number of pattern classifier design samples per class (Corresp.) , 1969, IEEE Trans. Inf. Theory.

[11]  Pieter M. Kroonenberg,et al.  A survey of algorithms for exact distributions of test statistics in r × c contingency tables with fixed margins , 1985 .

[12]  N. Glick Sample-Based Multinomial Classification , 1973 .

[13]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[14]  E. Dougherty,et al.  Confidence Intervals for the True Classification Error Conditioned on the Estimated Error , 2006, Technology in cancer research & treatment.

[15]  Gilles Celeux,et al.  Discrete regularized discriminant analysis , 1992 .

[16]  J. E. Jackson Discrete discriminant analysis , 1978 .

[17]  Ulisses Braga-Neto,et al.  Exact performance of error estimators for discrete classifiers , 2005, Pattern Recognit..

[18]  M. R. Mickey,et al.  Estimation of Error Rates in Discriminant Analysis , 1968 .

[19]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[20]  Nitin R. Patel,et al.  Computing Distributions for Exact Logistic Regression , 1987 .