Using Hellinger distance in a nearest neighbour classifier for relational databases

Abstract Nearest neighbour algorithms classify a previously unseen input case by finding similar cases to make predictions about the unknown features of the input case. The usefulness of the nearest neighbour algorithms has been demonstrated in many real-world domains. Unfortunately, most of the similarity measures discussed in the current nearest neighbour learning literature handle only limited data types, thus limiting their applicability to relational database applications. In this paper, we propose an enhanced nearest neighbour learning algorithm that is applicable to relational databases. The proposed method allows one to define similarity on a wide spectrum of attribute types. It automatically assigns to each attribute a weight of its importance with respect to the target attribute. The method has been implemented as a computer program and its effectiveness has been tested on four publicly available machine learning databases. Its performance is compared to another well-known machine learning method, C4.5. Our experimentation with the system demonstrates that the classification accuracy of the proposed system was superior to that of C4.5 in most cases.

[1]  Nada Lavrac,et al.  The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains , 1986, AAAI.

[2]  Steve G. Romaniuk Efficient storage of instances: the multi-pass approach , 1994, IEA/AIE '94.

[3]  Douglas L. Medin,et al.  Context theory of classification learning. , 1978 .

[4]  R. Nosofsky Attention, similarity, and the identification-categorization relationship. , 1986, Journal of experimental psychology. General.

[5]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[6]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[7]  Z. Ying Minimum Hellinger-Type Distance Estimation for Censored Data , 1992 .

[8]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[9]  R. Beran Minimum Hellinger distance estimates for parametric models , 1977 .

[10]  Jianping Zhang,et al.  Selecting Typical Instances in Instance-Based Learning , 1992, ML.

[11]  Douglas L. Hintzman,et al.  "Schema Abstraction" in a Multiple-Trace Memory Model , 1986 .

[12]  David B. Skalak,et al.  Prototype and Feature Selection by Sampling and Random Mutation Hill Climbing Algorithms , 1994, ICML.

[13]  David L. Waltz,et al.  Toward memory-based reasoning , 1986, CACM.

[14]  David W. Aha,et al.  Weighting Features , 1995, ICCBR.

[15]  Changhwan Lee,et al.  A Context-Sensitive Discretization of Numeric Attributes for Classification Learning , 1994, ECAI.