Classification Using Kernel Density Estimates

The use of kernel density estimates in discriminant analysis is quite well known among scientists and engineers interested in statistical pattern recognition. Using a kernel density estimate involves properly selecting the scale of smoothing, namely the bandwidth parameter. The bandwidth that is optimum for the mean integrated square error of a class density estimator may not always be good for discriminant analysis, where the main emphasis is on the minimization of misclassification rates. On the other hand, cross-validation–based methods for bandwidth selection, which try to minimize estimated misclassification rates, may require huge computation when there are several competing populations. Besides, such methods usually allow only one bandwidth for each population density estimate, whereas in a classification problem, the optimum bandwidth for a class density estimate may vary significantly, depending on its competing class densities and their prior probabilities. Therefore, in a multiclass problem, it would be more meaningful to have different bandwidths for a class density when it is compared with different competing class densities. Moreover, good choice of bandwidths should also depend on the specific observation to be classified. Consequently, instead of concentrating on a single optimum bandwidth for each population density estimate, it is more useful in practice to look at the results for different scales of smoothing for the kernel density estimates. This article presents such a multiscale approach along with a graphical device leading to a more informative discriminant analysis than the usual approach based on a single optimum scale of smoothing for each class density estimate. When there are more than two competing classes, this method splits the problem into a number of two-class problems, which allows the flexibility of using different bandwidths for different pairs of competing classes and at the same time reduces the computational burden that one faces for usual cross-validation–based bandwidth selection in the presence of several competing populations. We present some benchmark examples to illustrate the usefulness of the proposed methodology.

[1]  Brian D. Ripley,et al.  Neural Networks and Related Methods for Classification , 1994 .

[2]  Trevor Hastie,et al.  Neural Networks and Related Methods for Classification - Discussion , 1994 .

[3]  D. M. Titterington,et al.  Neural Networks: A Review from a Statistical Perspective , 1994 .

[4]  Terrence J. Sejnowski,et al.  Analysis of hidden units in a layered network trained to classify sonar targets , 1988, Neural Networks.

[5]  David G. Stork,et al.  Pattern Classification , 1973 .

[6]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[7]  L. Breiman Arcing classifier (with discussion and a rejoinder by the author) , 1998 .

[8]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[9]  C. J. Stone,et al.  An Asymptotically Optimal Window Selection Rule for Kernel Density Estimates , 1984 .

[10]  M. Wand,et al.  On nonparametric discrimination using density differences , 1988 .

[11]  David J. Marchette,et al.  The Bumpy Road to the Mode Forest , 1998 .

[12]  M. C. Jones,et al.  A reliable data-based bandwidth selection method for kernel density estimation , 1991 .

[13]  Probal Chaudhuri,et al.  Significance in Scale Space for Bivariate Density Estimation , 2002 .

[14]  B. D. Ripley,et al.  Potential pattern recognition in chemical and medical decision making , 1987 .

[15]  Brian D. Ripley,et al.  Pattern Recognition and Neural Networks , 1996 .

[16]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[17]  P. Hall Large Sample Optimality of Least Squares Cross-Validation in Density Estimation , 1983 .

[18]  R. Tibshirani,et al.  Flexible Discriminant Analysis by Optimal Scoring , 1994 .

[19]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[20]  Richard Lippmann,et al.  Practical Characteristics of Neural Network and Conventional Pattern Classifiers on Artificial and Speech Problems , 1989, NIPS.

[21]  Smarajit Bose,et al.  Classification using splines , 1996 .

[22]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[23]  W. Loh,et al.  Tree-Structured Classification Via Generalized Discriminant Analysis: Rejoinder , 1988 .

[24]  J. Marron,et al.  SCALE SPACE VIEW OF CURVE ESTIMATION , 2000 .

[25]  W. Loh,et al.  Tree-Structured Classification via Generalized Discriminant Analysis. , 1988 .

[26]  Robert Tibshirani,et al.  Classification by Pairwise Coupling , 1997, NIPS.

[27]  D. W. Scott,et al.  The Mode Tree: A Tool for Visualization of Nonparametric Density Features , 1993 .

[28]  M. Stone Cross-validation:a review 2 , 1978 .

[29]  Hans-Georg Müller,et al.  Smooth Optimum Kernel Estimators of Densities, Regression Curves and Modes , 1984 .

[30]  Hyunjoong Kim,et al.  Classification Trees With Unbiased Multiway Splits , 2001 .

[31]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[32]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[33]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[34]  Robin Sibson,et al.  What is projection pursuit , 1987 .

[35]  D. M. Titterington,et al.  [Neural Networks: A Review from Statistical Perspective]: Rejoinder , 1994 .

[36]  J. Marron,et al.  SiZer for Exploration of Structures in Curves , 1999 .

[37]  M. C. Jones,et al.  On optimal data-based bandwidth selection in kernel density estimation , 1991 .

[38]  Probal Chaudhuri,et al.  Statistical significance of features in digital images , 2004, Image Vis. Comput..

[39]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[40]  David J. Hand,et al.  Kernel Discriminant Analysis , 1983 .

[41]  Steven N. MacEachern,et al.  Classification via kernel product estimators , 1998 .

[42]  Anil K. Ghosh,et al.  OPTIMAL SMOOTHING IN KERNEL DISCRIMINANT ANALYSIS , 2004 .

[43]  M. C. Jones,et al.  A Brief Survey of Bandwidth Selection for Density Estimation , 1996 .

[44]  G. E. Peterson,et al.  Control Methods Used in a Study of the Vowels , 1951 .

[45]  G. Reaven,et al.  An attempt to define the nature of chemical diabetes using a multidimensional analysis , 2004, Diabetologia.

[46]  P. Hall,et al.  Martingale Limit Theory and Its Application , 1980 .