The problem of poor data quality is prevalent across multiple domains. A quantitative presence of noise in a given dataset is often reflective of the quality of the data. Data noise is generally categorized into two groups: mislabeling errors (class noise) and attribute noise. In the literature, noise detection techniques such as ensemble filter, partitioning filter, data polishing etc. have been proposed. However, several of these techniques lack adequate noise detection accuracy. In addition, they simply filter instances as noisy without providing a relative sense of noise among those instances. A novel approach for noise detection - partitioning- and rule-based filter is proposed. The approach functions by aggregating four unique mechanisms to achieve high-accuracy in noise detection and to provide a relative noise-based ranking of instances. These mechanisms include: repeated data partitioning, inclusive evaluation, un-weighted voting, and dual-two-class-classifiers. The proposed approach is evaluated using datasets obtained from the UCI data repository. Empirical studies with simulated (artificial) noise injected into clean or benchmark datasets demonstrate the excellent noise detection performance - in many cases, a perfect or near-perfect performance is observed. In addition, the proposed approach depicted significantly better noise detection rates in detecting class noise than a proven existing approach, partitioning filter.
[1]
William W. Cohen.
Fast Effective Rule Induction
,
1995,
ICML.
[2]
Matthias Jarke,et al.
Systematic Development of Data Mining-Based Data Quality Tools
,
2003,
VLDB.
[3]
Saso Dzeroski,et al.
Noise Elimination in Inductive Concept Learning: A Case Study in Medical Diagnosois
,
1996,
ALT.
[4]
Ken Orr,et al.
Data quality and systems theory
,
1998,
CACM.
[5]
Carla E. Brodley,et al.
Improving automated land cover mapping by identifying and eliminating mislabeled observations from training data
,
1996,
IGARSS '96. 1996 International Geoscience and Remote Sensing Symposium.
[6]
Choh-Man Teng,et al.
A Comparison of Noise Handling Techniques
,
2001,
FLAIRS.
[7]
Nada Lavrac,et al.
Experiments with Noise Filtering in a Medical Domain
,
1999,
ICML.
[8]
Xindong Wu,et al.
Eliminating Class Noise in Large Datasets
,
2003,
ICML.
[9]
Thomas Redman,et al.
The impact of poor data quality on the typical enterprise
,
1998,
CACM.