论文信息 - Parameter-Free Imputation for Imbalance Datasets

Parameter-Free Imputation for Imbalance Datasets

Class imbalance is a problem that aims to improve the accuracy of a minority class, while imputation is a process to replace missing values. Traditionally, class imbalance and imputation problems are considered independently. In addition, filled-in minority-class values that are substituted by traditional methods are not sufficient for imbalance datasets. In this paper, we provide a new parameter-free imputation to operate on imbalance datasets by estimating a random value between the mean of the missing value attribute and a value in this attribute of the closet record instance from the missing value record. Our proposed algorithm ignores mean of instances to avoid an over-fitting problem. Consequently, experimental results on imbalance datasets reveal that our imputation outperforms other techniques, when class imbalance measures are used.

Chumphol Bunkhumpornpat | Jintana Takum | C. Bunkhumpornpat | Jintana Takum

[1] Andrew P. Bradley,et al. The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[2] Catherine Blake,et al. UCI Repository of machine learning databases , 1998 .

[3] อนิรุธ สืบสิงห์,et al. Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[4] Chumphol Bunkhumpornpat,et al. Safe level graph for synthetic minority over-sampling techniques , 2013, 2013 13th International Symposium on Communications and Information Technologies (ISCIT).

[5] Nathalie Japkowicz,et al. The Class Imbalance Problem: Significance and Strategies , 2000 .

[6] Tony R. Martinez,et al. Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[7] Andrew Gelman,et al. Data Analysis Using Regression and Multilevel/Hierarchical Models: Multilevel linear models: the basics , 2006 .

[8] Giles Oatley,et al. A Fast Multivariate Nearest Neighbour Imputation Algorithm , 2007, World Congress on Engineering.

[9] Gustavo E. A. P. A. Batista,et al. A Study of K-Nearest Neighbour as an Imputation Method , 2002, HIS.

[10] Bo-Cheng Wei,et al. Case-deletion measures for models with incomplete data , 2001 .

[11] Aníbal R. Figueiras-Vidal,et al. Pattern classification with missing data: a review , 2010, Neural Computing and Applications.

[12] Andrew Gelman,et al. Data Analysis Using Regression and Multilevel/Hierarchical Models: Missing-data imputation , 2006 .

[13] Michael J. Pazzani,et al. Reducing Misclassification Costs , 1994, ICML.

[14] Gustavo E. A. P. A. Batista,et al. Experimental comparison pf K-NEAREST NEIGHBOUR and MEAN OR MODE imputation methods with the internal strategies used by C4.5 and CN2 to treat missing data , 2003 .

[15] Fredric C. Gey,et al. The relationship between recall and precision , 1994 .