Detecting noisy instances with the rule-based classification model

The performance of a classification model is invariably affected by the characteristics of measurement data it is built upon. If quality of the data is generally poor, then the classification model will demonstrate poor performance. The amount of noisy instances present in a given dataset is a good reflection of quality of the data. The detection and removal of noisy data instances will improve quality of the data, and consequently the performance of the classification model. This study presents an attractive and user-friendly approach for detecting data noise based on Boolean rules generated from the measurement data. The approach follows a simple and replicable approach that analyzes the rules to detect mislabeled noisy instances in the training dataset. Such instances are treated as data noise, and are removed to obtain a clean dataset. A case study of a software measurement dataset with known noisy instances is used to demonstrate the effectiveness of our approach. The dataset is obtained from a NASA software project developed for realtime predictions based on simulations. It is empirically demonstrated that the proposed approach is extremely effective in detecting noise in the dataset; in fact, the approach detected 100% of the known noisy instances. The proposed approach is compared with noise filtering based on five classification filters and an ensemble filter of five classifiers. We also demonstrate that the proposed approach shows excellent promise in detecting noisy instances in several (six) independent and real-world software measurement datasets with unknown noisy instances.

[1]  David J. Groggel,et al.  Practical Nonparametric Statistics , 2000, Technometrics.

[2]  Taghi M. Khoshgoftaar,et al.  Analyzing software measurement data with clustering techniques , 2004, IEEE Intelligent Systems.

[3]  Salvatore J. Stolfo,et al.  A data mining framework for building intrusion detection models , 1999, Proceedings of the 1999 IEEE Symposium on Security and Privacy (Cat. No.99CB36344).

[4]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[5]  Xindong Wu,et al.  Eliminating Class Noise in Large Datasets , 2003, ICML.

[6]  Shari Lawrence Pfleeger,et al.  Software Metrics : A Rigorous and Practical Approach , 1998 .

[7]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[8]  Taghi M. Khoshgoftaar,et al.  Comparative Assessment of Software Quality Classification Techniques: An Empirical Case Study , 2004, Empirical Software Engineering.

[9]  Khaled El Emam,et al.  Comparing case-based reasoning classifiers for predicting high risk software components , 2001, J. Syst. Softw..

[10]  Graham J. Williams,et al.  On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms , 2000, KDD '00.

[11]  Taghi M. Khoshgoftaar,et al.  Detecting Outliers Using Rule-Based Modeling for Improving CBR-Based Software Quality Classification Models , 2003, ICCBR.

[12]  Shari Lawrence Pfleeger,et al.  Software metrics (2nd ed.): a rigorous and practical approach , 1997 .

[13]  Graham J. Williams,et al.  On-Line Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms , 2000, KDD '00.

[14]  Taghi M. Khoshgoftaar,et al.  Enhancing software quality estimation using ensemble-classifier based noise filtering , 2005, Intell. Data Anal..

[15]  N. E. Schneidewind,et al.  Body of Knowledge for Software Quality Measurement , 2002, Computer.

[16]  Martin J. Shepperd,et al.  Comparing Software Prediction Techniques Using Simulation , 2001, IEEE Trans. Software Eng..

[17]  Nada Lavrac,et al.  Experiments with Noise Filtering in a Medical Domain , 1999, ICML.

[18]  Taghi M. Khoshgoftaar,et al.  The necessity of assuring quality in software measurement data , 2004, 10th International Symposium on Software Metrics, 2004. Proceedings..

[19]  Ian Witten,et al.  Data Mining , 2000 .

[20]  Claes Wohlin,et al.  A Classification Scheme for Studies on Fault-Prone Components , 2001, PROFES.

[21]  Taghi M. Khoshgoftaar,et al.  Improving usefulness of software quality classification models based on Boolean discriminant functions , 2002, 13th International Symposium on Software Reliability Engineering, 2002. Proceedings..

[22]  Norman F. Schneidewind,et al.  Investigation of logistic regression as a discriminant of software quality , 2001, Proceedings Seventh International Software Metrics Symposium.

[23]  Ulrich Güntzer,et al.  Data Quality Mining - Making a Virute of Necessity , 2001, DMKD.

[24]  Hausi A. Müller,et al.  Predicting fault-proneness using OO metrics. An industrial case study , 2002, Proceedings of the Sixth European Conference on Software Maintenance and Reengineering.

[25]  W. Pedrycz,et al.  Software quality prediction using median-adjusted class labels , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).