OPTIMAL SMOOTHING IN KERNEL DISCRIMINANT ANALYSIS

One well-known use of kernel density estimates is in nonparametric dis- criminant analysis, and its popularity is evident in its implementation in some commonly used statistical softwares (e.g., SAS). In this paper, we make a critical investigation into the influence of the value of the bandwidth on the behavior of the average misclassification probability of a classifier that is based on kernel density es- timates. In the course of this investigation, we have observed some counter-intuitive results. For instance, the use of bandwidths that minimize mean integrated square errors of kernel estimates of population densities may lead to rather poor average misclassification rates. Further, the best choice of smoothing parameters in clas- sification problems not only depends on the underlying true densities and sample sizes but also on prior probabilities. In particular, if the prior probabilities are all equal, the behavior of the average misclassification probability turns out to be quite interesting when both the sample sizes and the bandwidths are large. Our theoretical analysis provides some new insights into the problem of smoothing in nonparametric discriminant analysis. We also observe that popular cross-validation techniques (e.g., leave-one-out or V -fold) may not be very effective for selecting the bandwidth in practice. As a by-product of our investigation, we present a method for choosing appropriate values of the bandwidths when kernel density estimates are fitted to the training sample in a classification problem. The performance of the proposed method has been demonstrated using some simulation experiments as well as analysis of benchmark data sets, and its asymptotic properties have been studied under some regularity conditions.

[1]  Robert Tibshirani,et al.  Discriminant Adaptive Nearest Neighbor Classification , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Steven N. MacEachern,et al.  Classification via kernel product estimators , 1998 .

[3]  Theofanis Sapatinas,et al.  Discriminant Analysis and Statistical Pattern Recognition , 2005 .

[4]  D. Coomans,et al.  Potential pattern recognition in chemical and medical decision making , 1986 .

[5]  Hans-Georg Müller,et al.  Smooth Optimum Kernel Estimators of Densities, Regression Curves and Modes , 1984 .

[6]  R. Tibshirani,et al.  Flexible Discriminant Analysis by Optimal Scoring , 1994 .

[7]  M. C. Jones,et al.  A Brief Survey of Bandwidth Selection for Density Estimation , 1996 .

[8]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[9]  David G. Stork,et al.  Pattern Classification , 1973 .

[10]  P. Hall Large Sample Optimality of Least Squares Cross-Validation in Density Estimation , 1983 .

[11]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[12]  W. Loh,et al.  SPLIT SELECTION METHODS FOR CLASSIFICATION TREES , 1997 .

[13]  F. Mosteller,et al.  Inference in an Authorship Problem , 1963 .

[14]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[15]  M. R. Mickey,et al.  Estimation of Error Rates in Discriminant Analysis , 1968 .

[16]  C. J. Stone,et al.  Polychotomous Regression , 1995 .

[17]  B. Silverman Density estimation for statistics and data analysis , 1986 .

[18]  W. Loh,et al.  Tree-Structured Classification Via Generalized Discriminant Analysis: Rejoinder , 1988 .

[19]  Hyunjoong Kim,et al.  Classification Trees With Unbiased Multiway Splits , 2001 .

[20]  Mike James,et al.  Classification Algorithms , 1986, Encyclopedia of Machine Learning and Data Mining.

[21]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[22]  R. Rao,et al.  Normal Approximation and Asymptotic Expansions , 1976 .

[23]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[24]  M. C. Jones,et al.  A reliable data-based bandwidth selection method for kernel density estimation , 1991 .

[25]  Olivier Y. de Vel,et al.  Comparative analysis of statistical pattern recognition methods in high dimensional settings , 1994, Pattern Recognit..

[26]  W. Loh,et al.  Tree-Structured Classification via Generalized Discriminant Analysis. , 1988 .

[27]  J. Marron,et al.  SCALE SPACE VIEW OF CURVE ESTIMATION , 2000 .

[28]  J. Marron,et al.  SiZer for Exploration of Structures in Curves , 1999 .

[29]  M. Hills Allocation Rules and Their Error Rates , 1966 .

[30]  T. W. Anderson An Introduction to Multivariate Statistical Analysis , 1959 .

[31]  Brian D. Ripley,et al.  Pattern Recognition and Neural Networks , 1996 .

[32]  C. J. Stone,et al.  An Asymptotically Optimal Window Selection Rule for Kernel Density Estimates , 1984 .

[33]  Hakan Erdogan,et al.  KERNEL DISCRIMINANT ANALYSIS FOR SPEECH RECOGNITION , 2004 .

[34]  Smarajit Bose,et al.  Classification using splines , 1996 .

[35]  M. C. Jones,et al.  On optimal data-based bandwidth selection in kernel density estimation , 1991 .

[36]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[37]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[38]  J. Marron,et al.  Extent to which least-squares cross-validation minimises integrated square error in nonparametric density estimation , 1987 .

[39]  M. Wand,et al.  On nonparametric discrimination using density differences , 1988 .

[40]  T. Caliński,et al.  Linear Statistical Inference , 1985 .

[41]  J. Marron,et al.  Progress in data-based bandwidth selection for kernel density estimation , 1996 .

[42]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .