Data Cleaning for Classification Using Misclassification Analysis

In most classification problems, sometimes in order to achieve better results, data cleaning is used as a preprocessing technique. The purpose of data cleaning is to remove noise, inconsistent data and errors in the training data. This should enable the use of a better and representative data set to develop a reliable classification model. In most classification models, unclean data could sometime affect the classification accuracies of a model. In this paper, we investigate the use of misclassification analysis for data cleaning. In order to demonstrate our concept, we have used Artificial Neural Network (ANN) as the core computational intelligence technique. We use four benchmark data sets obtained from the University of California Irvine (UCI) machine learning repository to investigate the results from our proposed data cleaning technique. The experimental data sets used in our experiment are binary classification problems, which are German credit data, BUPA liver disorders, Johns Hopkins Ionosphere and Pima Indians Diabetes. The results show that the proposed cleaning technique could be a good alternative to provide some confidence when constructing a classification model.

[1]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Pre-processing for noise detection in gene expression classification data , 2009, Journal of the Brazilian Computer Society.

[2]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Use of Classification Algorithms in Noise Detection and Elimination , 2009, HAIS.

[3]  Chun Che Fung,et al.  Comparing performance of interval neutrosophic sets and neural networks with support vector machines for binary classification problems , 2008, 2008 2nd IEEE International Conference on Digital Ecosystems and Technologies.

[4]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[5]  Xindong Wu,et al.  Eliminating Class Noise in Large Datasets , 2003, ICML.

[6]  Halit Eren,et al.  Modular artificial neural network for prediction of petrophysical properties from well log data , 1996, Quality Measurement: The Indispensable Bridge between Theory and Reality (No Measurements? No Science! Joint Conference - 1996: IEEE Instrumentation and Measurement Technology Conference and IMEKO Tec.

[7]  László T. Kóczy,et al.  Fuzzy rule interpolation for multidimensional input spaces with applications: a case study , 2005, IEEE Transactions on Fuzzy Systems.

[8]  Anneleen Van Assche,et al.  Ensemble Methods for Noise Elimination in Classification Problems , 2003, Multiple Classifier Systems.

[9]  Kevin Kok Wai Wong,et al.  Classification of adaptive memetic algorithms: a comparative study , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[10]  Lance Chun Che Fung,et al.  Porosity Prediction Using Bagging of Complementary Neural Networks , 2009, ISNN.

[11]  T. Warren Liao,et al.  Classification of weld flaws with imbalanced class data , 2008, Expert Syst. Appl..

[12]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[13]  Neil Davey,et al.  Using sampling methods to improve binding site predictions , 2006, ESANN.

[14]  Lance Chun Che Fung,et al.  Binary classification using ensemble neural networks and interval neutrosophic sets , 2009, Neurocomputing.

[15]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[16]  Kevin Kok Wai Wong,et al.  Fuzzy Rule Interpolation Matlab Toolbox - FRI Toolbox , 2006, 2006 IEEE International Conference on Fuzzy Systems.

[17]  Kok Wai Wong,et al.  The STAG Oilfied formation evaluation: a neural network approach , 1999 .

[18]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[19]  Chun Che Fung,et al.  Simulated annealing based economic dispatch algorithm , 1993 .