Error Detection and Impact-Sensitive Instance Ranking in Noisy Datasets

Given a noisy dataset, how to locate erroneous instances and attributes and rank suspicious instances based on their impacts on the system performance is an interesting and important research issue. We provide in this paper an Error Detection and Impact-sensitive instance Ranking (EDIR) mechanism to address this problem. Given a noisy dataset D, we first train a benchmark classifier T from D. The instances, that cannot be effectively classified by T are treated as suspicious and forwarded to a subset S. For each attribute Ai, we switch Ai and the class label C to train a classifier APi for Ai. Given an instance Ik in S, we use APi and the benchmark classifier T to locate the erroneous value of each attribute Ai. To quantitatively rank instances in S, we define an impact measure based on the Information-gain Ratio (IR). We calculate IRi between attribute Ai and C, and use IRi as the impact-sensitive weight of Ai. The sum of impact-sensitive weights from all located erroneous attributes of Ik indicates its total impact value. The experimental results demonstrate the effectiveness of our strategies.

[1]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis , 2000, IQ.

[2]  D. Holt,et al.  A Systematic Approach to Automatic Edit and Imputation , 1976 .

[3]  Xiaohui Liu,et al.  Analyzing Outliers Cautiously , 2002, IEEE Trans. Knowl. Data Eng..

[4]  Steven A. Wolfman,et al.  Cleaning Data with Bayesian Methods , 2000 .

[5]  Xindong Wu Knowledge Acquisition from Databases , 1995 .

[6]  Andrew W. Moore,et al.  Probabilistic noise identification and data cleaning , 2003, Third IEEE International Conference on Data Mining.

[7]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[8]  Philip J. Stone,et al.  Experiments in induction , 1966 .

[9]  Richard Y. Wang,et al.  Data Quality , 2000, Advances in Database Systems.

[10]  Choh-Man Teng,et al.  Correcting Noisy Data , 1999, ICML.

[11]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis 1 , 2000 .

[12]  Alex Alves Freitas,et al.  Understanding the Crucial Role of Attribute Interaction in Data Mining , 2001, Artificial Intelligence Review.

[13]  Ken Orr Data Quality and System Theory. , 1998 .

[14]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[15]  Nada Lavrac,et al.  Experiments with Noise Filtering in a Medical Domain , 1999, ICML.

[16]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[17]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[18]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[19]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[20]  Xindong Wu,et al.  Eliminating Class Noise in Large Datasets , 2003, ICML.

[21]  Alen D. Shapiro,et al.  Structured induction in expert systems , 1987 .

[22]  Ken Orr,et al.  Data quality and systems theory , 1998, CACM.

[23]  A. D. Shapiro Structured Induction in expert systems: A D Shapiro, Turing Institute Press/Addison Wesley, UK (1987) 134 pp, £18.95, ISBN 0 201 17813 3 , 1989 .