The partitioning- and rule-based filter for noise detection

The problem of poor data quality is prevalent across multiple domains. A quantitative presence of noise in a given dataset is often reflective of the quality of the data. Data noise is generally categorized into two groups: mislabeling errors (class noise) and attribute noise. In the literature, noise detection techniques such as ensemble filter, partitioning filter, data polishing etc. have been proposed. However, several of these techniques lack adequate noise detection accuracy. In addition, they simply filter instances as noisy without providing a relative sense of noise among those instances. A novel approach for noise detection - partitioning- and rule-based filter is proposed. The approach functions by aggregating four unique mechanisms to achieve high-accuracy in noise detection and to provide a relative noise-based ranking of instances. These mechanisms include: repeated data partitioning, inclusive evaluation, un-weighted voting, and dual-two-class-classifiers. The proposed approach is evaluated using datasets obtained from the UCI data repository. Empirical studies with simulated (artificial) noise injected into clean or benchmark datasets demonstrate the excellent noise detection performance - in many cases, a perfect or near-perfect performance is observed. In addition, the proposed approach depicted significantly better noise detection rates in detecting class noise than a proven existing approach, partitioning filter.