Empirical Case Studies in Attribute Noise Detection

The quality of data is an important issue in any domain-specific data mining and knowledge discovery initiative. The validity of solutions produced by data-driven algorithms can be diminished if the data being analyzed are of low quality. The quality of data is often realized in terms of data noise present in the given dataset and can include noisy attributes or labeling errors. Hence, tools for improving the quality of data are important to the data mining analyst. We present a comprehensive empirical investigation of our new and innovative technique for ranking attributes in a given dataset from most to least noisy. Upon identifying the noisy attributes, specific treatments can be applied depending on how the data are to be used. In a classification setting, for example, if the class label is determined to contain the most noise, processes to cleanse this important attribute may be undertaken. Independent variables or predictors that have a low correlation to the class attribute and appear noisy may be eliminated from the analysis. Several case studies using both real-world and synthetic datasets are presented in this study. The noise detection performance is evaluated by injecting noise into multiple attributes at different noise levels. The empirical results demonstrate conclusively that our technique provides a very accurate and useful ranking of noisy attributes in a given dataset.

[1]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[2]  Taghi M. Khoshgoftaar,et al.  Classification of Fault-Prone Software Modules: Prior Probabilities, Costs, and Model Evaluation , 1998, Empirical Software Engineering.

[3]  E. S. Keeping,et al.  Introduction to statistical inference , 1958 .

[4]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[5]  Shari Lawrence Pfleeger,et al.  Software metrics (2nd ed.): a rigorous and practical approach , 1997 .

[6]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[7]  Carla E. Brodley,et al.  Identifying and Eliminating Mislabeled Training Instances , 1996, AAAI/IAAI, Vol. 1.

[8]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[9]  Choh-Man Teng,et al.  A Comparison of Noise Handling Techniques , 2001, FLAIRS.

[10]  Shari Lawrence Pfleeger,et al.  Software Metrics : A Rigorous and Practical Approach , 1998 .

[11]  W. J. Conover,et al.  Practical Nonparametric Statistics , 1972 .

[12]  Choh-Man Teng,et al.  Correcting Noisy Data , 1999, ICML.

[13]  Taghi M. Khoshgoftaar,et al.  The pairwise attribute noise detection algorithm , 2007, Knowledge and Information Systems.

[14]  Xindong Wu,et al.  Cost-guided class noise handling for effective cost-sensitive learning , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[15]  Diane M. Strong,et al.  Data quality in context , 1997, CACM.

[16]  Hongjun Lu,et al.  Cleansing Data for Mining and Warehousing , 1999, DEXA.

[17]  Taghi M. Khoshgoftaar,et al.  The necessity of assuring quality in software measurement data , 2004 .

[18]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[19]  Taghi M. Khoshgoftaar,et al.  Enhancing software quality estimation using ensemble-classifier based noise filtering , 2005, Intell. Data Anal..

[20]  Taghi M. Khoshgoftaar,et al.  Analyzing software measurement data with clustering techniques , 2004, IEEE Intelligent Systems.

[21]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[22]  R. A. Groeneveld,et al.  Practical Nonparametric Statistics (2nd ed). , 1981 .

[23]  Carla E. Brodley,et al.  Improving automated land cover mapping by identifying and eliminating mislabeled observations from training data , 1996, IGARSS '96. 1996 International Geoscience and Remote Sensing Symposium.

[24]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[25]  Taghi M. Khoshgoftaar,et al.  Detecting Outliers Using Rule-Based Modeling for Improving CBR-Based Software Quality Classification Models , 2003, ICCBR.

[26]  Taghi M. Khoshgoftaar,et al.  Identifying noisy features with the Pairwise Attribute Noise Detection Algorithm , 2005, Intell. Data Anal..

[27]  Saso Dzeroski,et al.  Noise Elimination in Inductive Concept Learning: A Case Study in Medical Diagnosois , 1996, ALT.