An analysis of how training data complexity affects the nearest neighbor classifiers

The k-nearest neighbors (k-NN) classifier is one of the most popular supervised classification methods. It is very simple, intuitive and accurate in a great variety of real-world domains. Nonetheless, despite its simplicity and effectiveness, practical use of this rule has been historically limited due to its high storage requirements and the computational costs involved. On the other hand, the performance of this classifier appears strongly sensitive to training data complexity. In this context, by means of several problem difficulty measures, we try to characterize the behavior of the k-NN rule when working under certain situations. More specifically, the present analysis focuses on the use of some data complexity measures to describe class overlapping, feature space dimensionality and class density, and discover their relation with the practical accuracy of this classifier.

[1]  David J. Hand,et al.  Choosing k for two-class nearest neighbour classifiers with unbalanced classes , 2003, Pattern Recognit. Lett..

[2]  João Gama,et al.  Characterization of Classification Algorithms , 1995, EPIA.

[3]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[4]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[5]  Tin Kam Ho,et al.  Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[7]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[8]  Fan Xiao,et al.  Uniformity testing using minimal spanning tree , 2002, Object recognition supported by user interaction for service robots.

[9]  Tin Kam Ho,et al.  On classifier domains of competence , 2004, ICPR 2004.

[10]  Sameer Singh,et al.  PRISM – A novel framework for pattern recognition , 2003, Pattern Analysis & Applications.

[11]  Nobuhiro Yugami,et al.  Effects of domain characteristics on instance-based learning algorithms , 2003, Theor. Comput. Sci..

[12]  José Martínez Sotoca,et al.  Data Characterization for Effective Prototype Selection , 2005, IbPRIA.

[13]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[14]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[15]  Roberto Alejo,et al.  Analysis of new techniques to obtain quality training sets , 2003, Pattern Recognit. Lett..

[16]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[17]  Robert P. W. Duin,et al.  On the nonlinearity of pattern classifiers , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[18]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[19]  So Young Sohn,et al.  Meta Analysis of Classification Algorithms for Pattern Recognition , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Thomas M. Cover,et al.  Estimation by the nearest neighbor rule , 1968, IEEE Trans. Inf. Theory.