Parameter-Free Imputation for Imbalance Datasets

Class imbalance is a problem that aims to improve the accuracy of a minority class, while imputation is a process to replace missing values. Traditionally, class imbalance and imputation problems are considered independently. In addition, filled-in minority-class values that are substituted by traditional methods are not sufficient for imbalance datasets. In this paper, we provide a new parameter-free imputation to operate on imbalance datasets by estimating a random value between the mean of the missing value attribute and a value in this attribute of the closet record instance from the missing value record. Our proposed algorithm ignores mean of instances to avoid an over-fitting problem. Consequently, experimental results on imbalance datasets reveal that our imputation outperforms other techniques, when class imbalance measures are used.

[1]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[2]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[3]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[4]  Chumphol Bunkhumpornpat,et al.  Safe level graph for synthetic minority over-sampling techniques , 2013, 2013 13th International Symposium on Communications and Information Technologies (ISCIT).

[5]  Nathalie Japkowicz,et al.  The Class Imbalance Problem: Significance and Strategies , 2000 .

[6]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[7]  Andrew Gelman,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models: Multilevel linear models: the basics , 2006 .

[8]  Giles Oatley,et al.  A Fast Multivariate Nearest Neighbour Imputation Algorithm , 2007, World Congress on Engineering.

[9]  Gustavo E. A. P. A. Batista,et al.  A Study of K-Nearest Neighbour as an Imputation Method , 2002, HIS.

[10]  Bo-Cheng Wei,et al.  Case-deletion measures for models with incomplete data , 2001 .

[11]  Aníbal R. Figueiras-Vidal,et al.  Pattern classification with missing data: a review , 2010, Neural Computing and Applications.

[12]  Andrew Gelman,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models: Missing-data imputation , 2006 .

[13]  Michael J. Pazzani,et al.  Reducing Misclassification Costs , 1994, ICML.

[14]  Gustavo E. A. P. A. Batista,et al.  Experimental comparison pf K-NEAREST NEIGHBOUR and MEAN OR MODE imputation methods with the internal strategies used by C4.5 and CN2 to treat missing data , 2003 .

[15]  Fredric C. Gey,et al.  The relationship between recall and precision , 1994 .