Error awareness data mining

Real-world data mining applications often deal with low-quality information sources where data collection inaccuracy, device limitations, data transmission and discretization errors, or man-made perturbations frequently result in imprecise or vague data. Two common practices are to adopt either data cleansing to enhance data consistency or simply take noisy data as quality sources and feed them into the data mining algorithms. Either way may substantially sacrifice the mining performances. In this paper, we consider an error awareness data mining framework, which takes advantage of statistical error information (such as noise level and noise distribution) to improve data mining results. We assume such noise knowledge is available in advance, and propose a solution to incorporate it into the mining process. More specifically, we use noise knowledge to restore original data distributions, and then use the restored information to modify the model built from noise corrupted data. We present an Error Awareness Naive Bayes (EA_NB) classification algorithm, and provide extensive experimental comparisons to demonstrate the effectiveness of this effort.

[1]  Xindong Wu,et al.  Error Detection and Impact-Sensitive Instance Ranking in Noisy Datasets , 2004, AAAI.

[2]  Francesc J. Ferri,et al.  Considerations about sample-size sensitivity of a family of edited nearest-neighbor rules , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[3]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[4]  Wenliang Du,et al.  Using randomized response techniques for privacy-preserving data mining , 2003, KDD '03.

[5]  Xindong Wu,et al.  Eliminating Class Noise in Large Datasets , 2003, ICML.

[6]  Eric R. Ziegel,et al.  Mastering Data Mining , 2001, Technometrics.

[7]  Johannes Fürnkranz,et al.  Integrative Windowing , 1998, J. Artif. Intell. Res..

[8]  Veda C. Storey,et al.  A Framework for Analysis of Data Quality Research , 1995, IEEE Trans. Knowl. Data Eng..

[9]  Xindong Wu,et al.  Cost-constrained data acquisition for intelligent data preparation , 2005, IEEE Transactions on Knowledge and Data Engineering.

[10]  Wenliang Du,et al.  Deriving private information from randomized data , 2005, SIGMOD '05.

[11]  D. Holt,et al.  A Systematic Approach to Automatic Edit and Imputation , 1976 .

[12]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[13]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[14]  Shinichi Morishita,et al.  On Classification and Regression , 1998, Discovery Science.

[15]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[16]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.

[17]  Matthias Jarke,et al.  Systematic Development of Data Mining-Based Data Quality Tools , 2003, VLDB.

[18]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.