Synthetic Over Sampling Methods for Handling Class Imbalanced Problems : A Review

Class imbalanced commonly found in any real cases. Class imbalanced occur if one of the classes has smaller amount, called minority class, than other class (majority class). The problem of imbalanced data is usually associated with misclassification problem where the minority class tends to be misclassified as compared to the majority class. There are two approaches should be performed to solve imbalanced data problems, those are solution at data level and solution at algorithm level. Over sampling approach is used more frequently than the other data level solution methods. This study gives review of synthethic over sampling methods for handling imbalance data problem. The implementation of different methods will produce different characteristics of the generated synthetic data and the implementation of appropriate methods must be adapted to the problems faced such as the level and pattern of imbalanced data of data available. Results of the review show that there is no absolute methods that are more efficient in dealing with the class imbalance. However, the class imbalance problem depends on complexity of the data, level of class imbalance, size of data and classifier involved. Determination of over sampling strategy will affect the outcome of the over sampling. So it is still open better development oversampling methods for handling the class imbalance. The selection classifier and evaluation measures are important to get the best results. Statistical test approach is needed to assess the theoritical propertis of synthetic data and evaluate missclassification in addition to the evaluation methods that have been used.

[1]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[2]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[3]  Francisco Herrera,et al.  SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory , 2012, Knowledge and Information Systems.

[4]  Kihoon Yoon,et al.  An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics , 2005, Fifth International Conference on Hybrid Intelligent Systems (HIS'05).

[5]  Herna L. Viktor,et al.  Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[6]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[7]  Nello Cristianini,et al.  Controlling the Sensitivity of Support Vector Machines , 1999 .

[8]  Szymon Wilk,et al.  Selective Pre-processing of Imbalanced Data for Improving Classification Performance , 2008, DaWaK.

[9]  Szymon Wilk,et al.  Learning from Imbalanced Data in Presence of Noisy and Borderline Examples , 2010, RSCTC.

[10]  Yue-Shi Lee,et al.  Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset , 2006 .

[11]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[12]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[13]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[14]  Kai Ming Ting,et al.  An Instance-weighting Method to Induce Cost-sensitive Trees , 2001 .

[15]  Antoine Geissbühler,et al.  Learning from imbalanced data in surveillance of nosocomial infection , 2006, Artif. Intell. Medicine.