Generating multiple noise elimination filters with the ensemble-partitioning filter

We present the ensemble-partitioning filter which is a generalization of some common filtering techniques developed in the literature. Filtering the training dataset, i.e., removing noisy data, can be used to improve the accuracy of the induced data mining learners. Tuning the few parameters of the ensemble-partitioning filter allows filtering a given data mining problem appropriately. For example, it is possible to specialize the ensemble-partitioning filter into the classification, ensemble, multiple-partitioning, or iterative-partitioning filter. The predictions of the filtering experts are then utilized such that if an instance is misclassified by a certain number of experts or learners, it is identified as noisy. The conservativeness of the ensemble-partitioning filter depends on the filtering level and the number of filtering iterations. A case study of software metrics data from a high assurance software project analyzes the similarities between the filters obtained from the specialization of the ensemble-partitioning filter. We show that over 25% of the time, the filters at different levels of conservativeness agree on labeling instances as noisy. In addition, the classification filter has the lowest agreement with the other filters.

[1]  Carla E. Brodley,et al.  Improving automated land cover mapping by identifying and eliminating mislabeled observations from training data , 1996, IGARSS '96. 1996 International Geoscience and Remote Sensing Symposium.

[2]  Andrew W. Moore,et al.  Locally Weighted Learning , 1997, Artificial Intelligence Review.

[3]  Saso Dzeroski,et al.  Noise Elimination in Inductive Concept Learning: A Case Study in Medical Diagnosois , 1996, ALT.

[4]  Taghi M. Khoshgoftaar,et al.  Noise Elimination with Ensemble-Classifier Filtering: A Case-Study in Software Quality Engineerin , 2004, SEKE.

[5]  Taghi M. Khoshgoftaar,et al.  Analogy-Based Practical Classification Rules for Software Quality Estimation , 2003, Empirical Software Engineering.

[6]  Kenneth C. Laudon,et al.  Data quality and due process in large interorganizational record systems , 1986, CACM.

[7]  Xindong Wu,et al.  Eliminating Class Noise in Large Datasets , 2003, ICML.

[8]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[9]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[10]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[11]  Taghi M. Khoshgoftaar,et al.  The necessity of assuring quality in software measurement data , 2004, 10th International Symposium on Software Metrics, 2004. Proceedings..

[12]  Choh-Man Teng,et al.  A Comparison of Noise Handling Techniques , 2001, FLAIRS.

[13]  Shari Lawrence Pfleeger,et al.  Software Metrics : A Rigorous and Practical Approach , 1998 .