论文信息 - Nearest-neighbor classification with categorical variables

Nearest-neighbor classification with categorical variables

Abstract A technique is presented for adopting nearest-neighbor classification to the case of categorical variables. The set of categories is mapped onto the real line in such a way as to maximize the ratio of total sum of squares to within-class sum of squares, aggregated over classes. The resulting real values then replace the categories, and nearest-neighbor classification proceeds with the Euclidean metric on these new values. Continuous variables can be included in this scheme with little added efort. This approach has been implemented in a computer program and tried on a number of data sets, with encouraging results. Nearest-neighbor classification is a well-known and efective classification technique. With this scheme, an unknown item's distances to all known items are measured, and the unknown class is estimated by the class of the nearest neighbor or by the class most often represented among a set of nearest neighbors. This has proven effective in many examples, but an appropriate distance normalization is required when variables are scaled differently. For categorical variables “distance” is not even defined. In this paper categorical data values are replaced by real numbers in an optimal way: then those real numbers are used in nearest-neighbor classification.

Samuel E. Buttrey | S. Buttrey

[1] D. W. Roncek,et al. Discrete Discriminant Analysis. , 1979 .

[2] J. Aitchison,et al. Multivariate binary discrimination by the kernel method , 1976 .

[3] R. Todeschini. k-nearest neighbour method: The influence of data transformations and metrics , 1989 .

[4] A. Agresti,et al. Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[5] R. Fisher. THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[6] David J. Spiegelhalter,et al. Machine Learning, Neural and Statistical Classification , 2009 .

[7] Richard E. Neapolitan,et al. Probabilistic reasoning in expert systems - theory and algorithms , 2012 .

[8] J. Friedman,et al. Estimating Optimal Transformations for Multiple Regression and Correlation. , 1985 .

[9] Brian D. Ripley,et al. Neural Networks and Related Methods for Classification , 1994 .

[10] Keinosuke Fukunaga,et al. An Optimal Global Nearest Neighbor Metric , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11] Leo Breiman,et al. Classification and Regression Trees , 1984 .

[12] Catherine Blake,et al. UCI Repository of machine learning databases , 1998 .