Mining With Noise Knowledge: Error-Aware Data Mining

Real-world data mining deals with noisy information sources where data collection inaccuracy, device limitations, data transmission and discretization errors, or man-made perturbations frequently result in imprecise or vague data. Two common practices are to adopt either data cleansing approaches to enhance the data consistency or simply take noisy data as quality sources and feed them into the data mining algorithms. Either way may substantially sacrifice the mining performance. In this paper, we consider an error-aware (EA) data mining design, which takes advantage of statistical error information (such as noise level and noise distribution) to improve data mining results. We assume that such noise knowledge is available in advance, and we propose a solution to incorporate it into the mining process. More specifically, we use noise knowledge to restore original data distributions, which are further used to rectify the model built from noise- corrupted data. We materialize this concept by the proposed EA naive Bayes classification algorithm. Experimental comparisons on real-world datasets will demonstrate the effectiveness of this design.

[1]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[2]  D. Holt,et al.  A Systematic Approach to Automatic Edit and Imputation , 1976 .

[3]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[4]  Roger A. Sugden,et al.  Multiple Imputation for Nonresponse in Surveys , 1988 .

[5]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[6]  U. Fayyad,et al.  On the handling of continuous-valued attributes in decision tree generation , 2004, Machine Learning.

[7]  Isabelle Guyon,et al.  Discovering Informative Patterns and Data Cleaning , 1996, Advances in Knowledge Discovery and Data Mining.

[8]  Veda C. Storey,et al.  A Framework for Analysis of Data Quality Research , 1995, IEEE Trans. Knowl. Data Eng..

[9]  George H. John Robust Decision Trees: Removing Outliers from Databases , 1995, KDD.

[10]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[11]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[12]  Carla E. Brodley,et al.  Improving automated land cover mapping by identifying and eliminating mislabeled observations from training data , 1996, IGARSS '96. 1996 International Geoscience and Remote Sensing Symposium.

[13]  U. M. Feyyad Data mining and knowledge discovery: making sense out of data , 1996 .

[14]  D. S. Sivia,et al.  Data Analysis , 1996, Encyclopedia of Evolutionary Psychological Science.

[15]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[16]  Johannes Fürnkranz,et al.  Integrative Windowing , 1998, J. Artif. Intell. Res..

[17]  Warren Sarle Prediction with Missing Inputs , 1998 .

[18]  Nada Lavrac,et al.  Experiments with Noise Filtering in a Medical Domain , 1999, ICML.

[19]  Choh-Man Teng,et al.  Correcting Noisy Data , 1999, ICML.

[20]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[21]  Francesc J. Ferri,et al.  Considerations about sample-size sensitivity of a family of edited nearest-neighbor rules , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[22]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[23]  Thomas Reinartz,et al.  CRISP-DM 1.0: Step-by-step data mining guide , 2000 .

[24]  Ramakrishnan Srikant,et al.  Privacy-preserving data mining , 2000, SIGMOD '00.

[25]  Eric R. Ziegel,et al.  Mastering Data Mining , 2001, Technometrics.

[26]  Michael Griebel,et al.  Data mining with sparse grids using simplicial basis functions , 2001, KDD '01.

[27]  Jean-Frangois Beaumont ON REGRESSION IMPUTATION IN THE PRESENCE OF NONIGNORABLE NONRESPONSE , 2002 .

[28]  Mingxiu Hu,et al.  EVALUATION OF SOME POPULAR IMPUTATION ALGORITHMS , 2002 .

[29]  Xindong Wu,et al.  Eliminating Class Noise in Large Datasets , 2003, ICML.

[30]  Matthias Jarke,et al.  Systematic Development of Data Mining-Based Data Quality Tools , 2003, VLDB.

[31]  Wenliang Du,et al.  Using randomized response techniques for privacy-preserving data mining , 2003, KDD '03.

[32]  Andrew W. Moore,et al.  Probabilistic noise identification and data cleaning , 2003, Third IEEE International Conference on Data Mining.

[33]  Xindong Wu,et al.  Dealing with Predictive-but-Unpredictable Attributes in Noisy Data Sources , 2004, PKDD.

[34]  Xindong Wu,et al.  Error Detection and Impact-Sensitive Instance Ranking in Noisy Datasets , 2004, AAAI.

[35]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[36]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[37]  Tariq Samad,et al.  Imputation of Missing Data in Industrial Databases , 1999, Applied Intelligence.

[38]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[39]  M. Scanu,et al.  Bayesian networks for imputation , 2004 .

[40]  Xindong Wu,et al.  Cost-constrained data acquisition for intelligent data preparation , 2005, IEEE Transactions on Knowledge and Data Engineering.

[41]  Wenliang Du,et al.  Deriving private information from randomized data , 2005, SIGMOD '05.

[42]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[43]  Stuart Aitken,et al.  Mining housekeeping genes with a Naive Bayes classifier , 2006, BMC Genomics.

[44]  Xindong Wu,et al.  Error awareness data mining , 2006, 2006 IEEE International Conference on Granular Computing.

[45]  Xindong Wu Class Noise vs Attribute Noise: Their Impacts, Detection and Cleansing , 2007, PAKDD.

[46]  Xindong Wu,et al.  Noise Modeling with Associative Corruption Rules , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).