Class noise elimination approach for large datasets based on a combination of classifiers

Noise points, or class noise, detection and elimination became increasingly important to handle large datasets. In fact, eliminating noise in this environment helps reduce computing costs, especially when using clustering algorithms. Nowadays, large varieties of clustering algorithms exist and produce good results. However, they often assume that the input data are free or have very low level of noise, which is rarely the case in real Big Data context. In this paper, we present a noise detection and elimination approach for large datasets. This approach relies on four important steps: divide data into subsets, extract the best rules, apply different classifiers to the subsets, and finally combine the classifiers results.

[1]  Donghai Guan,et al.  Class noise detection by multiple voting , 2013, 2013 Ninth International Conference on Natural Computation (ICNC).

[2]  Xindong Wu,et al.  Eliminating Class Noise in Large Datasets , 2003, ICML.

[3]  Taghi M. Khoshgoftaar,et al.  Noise elimination with partitioning filter for software quality estimation , 2006, Int. J. Comput. Appl. Technol..

[4]  Dimitris N. Metaxas,et al.  Distinguishing mislabeled data from correctly labeled data in classifier design , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[5]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Use of Classification Algorithms in Noise Detection and Elimination , 2009, HAIS.

[6]  Nada Lavrac,et al.  Advances in Class Noise Detection , 2010, ECAI.

[7]  Eduardo Gasca,et al.  Decontamination of Training Samples for Supervised Pattern Recognition Methods , 2000, SSPR/SPR.

[8]  Francisco Herrera,et al.  INFFC: An iterative class noise filter based on the fusion of classifiers with noise sensitivity control , 2016, Inf. Fusion.

[9]  Lance Chun Che Fung,et al.  Data Cleaning for Classification Using Misclassification Analysis , 2010, J. Adv. Comput. Intell. Intell. Informatics.

[10]  Xindong Wu,et al.  Bridging Local and Global Data Cleansing: Identifying Class Noise in Large, Distributed Data Datasets , 2006, Data Mining and Knowledge Discovery.

[11]  Alex A. Freitas,et al.  Discovering comprehensible classification rules with a genetic algorithm , 2000, Proceedings of the 2000 Congress on Evolutionary Computation. CEC00 (Cat. No.00TH8512).

[12]  Shifu Chen,et al.  Identifying and Correcting Mislabeled Training Instances , 2007, Future Generation Communication and Networking (FGCN 2007).

[13]  Xindong Wu,et al.  Class Noise Handling for Effective Cost-Sensitive Learning by Cost-Guided Iterative Classification Filtering , 2006, IEEE Transactions on Knowledge and Data Engineering.