Learning Naive Bayes Classifier from Noisy Data

Classification is one of the major tasks in knowledge discovery and data mining. Naive Bayes classifier, in spite of its simplicity, has proven surprisingly effective in many practical applications. In real datasets, noise is inevitable, because of the imprecision of measurement or privacy preserving mechanisms. In this paper, we develop a new approach, LinEar-Equation-based noise-aWare bAYes classifier (LEEWAY ), for learning the underlying naive Bayes classifier from noisy observations. Using linear system of equations and optimization methods, LEEWAY reconstructs the underlying probability distributions of the noise-free dataset based on the given noisy observations. By incorporating the noise model into the learning process, we improve the classification accuracy. Furthermore, as an estimate of the underlying naive Bayes classifier for the noise-free dataset, the reconstructed model can be easily combined with new observations that are corrupted at different noise levels to obtain a good predictive accuracy. Several experiments are presented to evaluate the performance of LEEWAY. The experimental results show that LEEWAY is an effective technique to handle noisy data and it provides higher classification accuracy than other traditional approaches. keywords: naive Bayes classifier, noisy data, classification, Bayesian network.

[1]  Stuart J. Russell,et al.  Adaptive Probabilistic Networks with Hidden Variables , 1997, Machine Learning.

[2]  Alexandre V. Evfimievski,et al.  Privacy preserving mining of association rules , 2002, Inf. Syst..

[3]  Mohamed Bendou,et al.  Learning Bayesian Networks From Noisy Data , 2003, ICEIS.

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  Philip S. Yu,et al.  Mining long sequential patterns in a noisy environment , 2002, SIGMOD '02.

[6]  Peter Clark,et al.  The CN2 Induction Algorithm , 1989, Machine Learning.

[7]  Carla E. Brodley,et al.  Identifying and Eliminating Mislabeled Training Instances , 1996, AAAI/IAAI, Vol. 1.

[8]  Bo Thiesson,et al.  Accelerated Quantification of Bayesian Networks with Incomplete Data , 1995, KDD.

[9]  Yi Xia,et al.  Mining Frequent Itemsets in Uncertain Datasets , 2004 .

[10]  Steven A. Wolfman,et al.  Cleaning Data with Bayesian Methods , 2000 .

[11]  Choh-Man Teng,et al.  Correcting Noisy Data , 1999, ICML.

[12]  Andrew W. Moore,et al.  Probabilistic noise identification and data cleaning , 2003, Third IEEE International Conference on Data Mining.

[13]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[14]  Kevin Murphy,et al.  Bayes net toolbox for Matlab , 1999 .

[15]  George H. John Robust Decision Trees: Removing Outliers from Databases , 1995, KDD.

[16]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.

[17]  Andreas Wendemuth,et al.  Modeling uncertainty of data observation , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[18]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[19]  J. Ross Quinlan,et al.  Simplifying decision trees , 1987, Int. J. Hum. Comput. Stud..