Polishing Blemishes: Issues in Data Correction

Data quality is crucial to any data analysis task. Many imperfection-handling techniques avoid overfitting or simply remove offending portions of the data. Polishing identifies blemishes in the data and makes corrections to retain and recover as much information as possible. When using information collected from channels susceptible to disturbances, data quality is a concern-especially when the primary objective is to assimilate and understand the data. Imperfections can arise from many sources, including transmission and bandwidth constraints, faults in sensor devices, irregularities in sampling, and transcription errors. An intuitive application that exemplifies handling data imperfections is the spell-checker. Developing such a spell-checker would require novel techniques for repairing data imperfections. We are exploring such techniques using a data correction method called polishing. Here, we compare polishing to two alternative approaches to handling data imperfections, focusing on how to evaluate and validate data correction mechanisms.

[1]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.

[2]  Choh-Man Teng,et al.  Correcting Noisy Data , 1999, ICML.

[3]  Nada Lavrac,et al.  Experiments with Noise Filtering in a Medical Domain , 1999, ICML.

[4]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[5]  Choh-Man Teng,et al.  Applying noise handling techniques to genomic data: a case study , 2003, Third IEEE International Conference on Data Mining.

[6]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[7]  George H. John Robust Decision Trees: Removing Outliers from Databases , 1995, KDD.

[8]  Carla E. Brodley,et al.  Improving automated land cover mapping by identifying and eliminating mislabeled observations from training data , 1996, IGARSS '96. 1996 International Geoscience and Remote Sensing Symposium.

[9]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[10]  Peter Clark,et al.  The CN2 induction algorithm , 2004, Machine Learning.

[11]  Choh-Man Teng Evaluating Noise Correction , 2000, PRICAI.