Bolstered error estimation

Abstract We propose a general method for error estimation that displays low variance and generally low bias as well. This method is based on “bolstering” the original empirical distribution of the data. It has a direct geometric interpretation and can be easily applied to any classification rule and any number of classes. This method can be used to improve the performance of any error-counting estimation method, such as resubstitution and all cross-validation estimators, particularly in small-sample settings. We point out some similarities shared by our method with a previously proposed technique, known as smoothed error estimation. In some important cases, such as a linear classification rule with a Gaussian bolstering kernel, the integrals in the bolstered error estimate can be computed exactly. In the general case, the bolstered error estimate may be computed by Monte-Carlo sampling; however, our experiments show that a very small number of Monte-Carlo samples is needed. This results in a fast error estimator, which is in contrast to other resampling techniques, such as the bootstrap. We provide an extensive simulation study comparing the proposed method with resubstitution, cross-validation, and bootstrap error estimation, for three popular classification rules (linear discriminant analysis, k -nearest-neighbor, and decision trees), using several sample sizes, from small to moderate. The results indicate the proposed method vastly improves on resubstitution and cross-validation, especially for small samples, in terms of bias and variance. In that respect, it is competitive with, and in many occasions superior to, bootstrap error estimation, while being tens to hundreds of times faster. We provide a companion web site, which contains: (1) the complete set of tables and plots regarding the simulation study, and (2) C source code used to implement the bolstered error estimators proposed in this paper, as part of a larger library for classification and error estimation, with full documentation and examples. The companion web site can be accessed at the URL http://ee.tamu.edu/~edward/bolster .

[1]  Ian Witten,et al.  Data Mining , 2000 .

[2]  Michael L. Bittner,et al.  Strong Feature Sets from Small Samples , 2002, J. Comput. Biol..

[3]  T. W. Anderson Classification by multivariate analysis , 1951 .

[4]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[5]  M. R. Mickey,et al.  Estimation of Error Rates in Discriminant Analysis , 1968 .

[6]  C. A. Smith Some examples of discrimination. , 1947, Annals of eugenics.

[7]  R. E. Wheeler Statistical distributions , 1983, APLQ.

[8]  Gerhard E. Tutz,et al.  Smoothed additive estimators for non-error rates in multiple discriminant analysis , 1985, Pattern Recognit..

[9]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[10]  J D Knoke,et al.  Estimation of error rates in discriminant analysis with selection of variables. , 1989, Biometrics.

[11]  M. Evans Statistical Distributions , 2000 .

[12]  James T. Wassell,et al.  Bootstrap Methods: A Practitioner's Guide , 2001, Technometrics.

[13]  B. Efron Bootstrap Methods: Another Look at the Jackknife , 1979 .

[14]  S. Snapinn,et al.  An Evaluation of Smoothed Classification Error- Rate Estimators , 1985 .

[15]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[16]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[17]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[18]  David Hirst Error-rate estimation in multiple-group linear discriminant analysis , 1996 .

[19]  Edward R. Dougherty,et al.  Small Sample Issues for Microarray-Based Classification , 2001, Comparative and functional genomics.

[20]  David G. Stork,et al.  Pattern Classification , 1973 .

[21]  Ned Glick,et al.  Additive estimators for probabilities of correct classification , 1978, Pattern Recognit..

[22]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[23]  S. Lawler,et al.  Data on linkage in man; elliptocytosis and blood groups. II. Family 3. , 1953, Annals of Eugenics.

[24]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .