The instance-based k-nearest neighbor algorithm (KNN)[1] is an effective classification model Its classification is simply based on a vote within the neighborhood, consisting of k nearest neighbors of the test instance Recently, researchers have been interested in deploying a more sophisticated local model, such as naive Bayes, within the neighborhood It is expected that there are no strong dependences within the neighborhood of the test instance, thus alleviating the conditional independence assumption of naive Bayes Generally, the smaller size of the neighborhood (the value of k), the less chance of encountering strong dependences When k is small, however, the training data for the local naive Bayes is small and its classification would be inaccurate In the currently existing models, such as LWNB [3], a relatively large k is chosen The consequence is that strong dependences seem unavoidable.
In our opinion, a small k should be preferred in order to avoid strong dependences We propose to deal with the problem of lack of local training data using sampling (cloning) Given a test instance, clones of each instance in the neighborhood is generated in terms of its similarity to the test instance and added to the local training data Then, the local naive Bayes is trained from the expanded training data Since a relatively small k is chosen, the chance of encountering strong dependences within the neighborhood is small Thus the classification of the resulting local naive Bayes would be more accurate We experimentally compare our new algorithm with KNN and its improved variants in terms of classification accuracy, using the 36 UCI datasets recommended by Weka [8], and the experimental results show that our algorithm outperforms all those algorithms significantly and consistently at various k values.
[1]
Catherine Blake,et al.
UCI Repository of machine learning databases
,
1998
.
[2]
Yoshua Bengio,et al.
Inference for the Generalization Error
,
1999,
Machine Learning.
[3]
D. Kibler,et al.
Instance-based learning algorithms
,
2004,
Machine Learning.
[4]
Ian H. Witten,et al.
Data mining: practical machine learning tools and techniques, 3rd Edition
,
1999
.
[5]
Ron Kohavi,et al.
Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid
,
1996,
KDD.
[6]
Bernhard Pfahringer,et al.
Locally Weighted Naive Bayes
,
2002,
UAI.
[7]
Geoffrey I. Webb,et al.
Lazy Learning of Bayesian Rules
,
2000,
Machine Learning.
[8]
Pat Langley,et al.
An Analysis of Bayesian Classifiers
,
1992,
AAAI.
[9]
Ian H. Witten,et al.
Data mining: practical machine learning tools and techniques with Java implementations
,
2002,
SGMD.
[10]
Pedro M. Domingos,et al.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss
,
1997,
Machine Learning.
[11]
Foster J. Provost,et al.
Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction
,
2003,
J. Artif. Intell. Res..
[12]
Pedro M. Domingos,et al.
Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier
,
1996,
ICML.
[13]
Mong-Li Lee,et al.
SNNB: A Selective Neighborhood Based Naïve Bayes for Lazy Learning
,
2002,
PAKDD.