Stratified Over-Sampling Bagging Method for Random Forests on Imbalanced Data

Imbalanced data presents a big challenge to random forests RF. Over-sampling is a commonly used sampling method for imbalanced data, which increases the number of instances of minority class to balance the class distribution. However, such method often produces sample data sets that are highly correlated if we only sample more minority class instances, thus reducing the generalizability of RF. To solve this problem, we propose a stratified over-sampling SOB method to generate both balanced and diverse training data sets for RF. We first cluster the training data set multiple times to produce multiple clustering results. The small individual clusters are grouped according to their entropies. Then we sample a set of training data sets from the groups of clusters using stratified sampling method. Finally, these training data sets are used to train RF. The data sets sampled with SOB are guaranteed to be balanced and diverse, which improves the performance of RF on imbalanced data. We have conducted a series of experiments, and the experimental results have shown that the proposed method is more effective than some existing sampling methods.

[1]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[2]  Joshua Zhexue Huang,et al.  Two-level quantile regression forests for bias correction in range prediction , 2014, Machine Learning.

[3]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[4]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[5]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Yunming Ye,et al.  Stratified sampling for feature subspace selection in random forests for high dimensional data , 2013, Pattern Recognit..

[7]  Marlon Núñez,et al.  The Use of Background Knowledge in Decision Tree Induction , 1991, Machine Learning.

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[10]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[11]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[12]  Lawrence O. Hall,et al.  A Comparison of Decision Tree Ensemble Creation Techniques , 2007 .

[13]  Taghi M. Khoshgoftaar,et al.  Hybrid sampling for imbalanced data , 2008, 2008 IEEE International Conference on Information Reuse and Integration.

[14]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[15]  Yang Liu,et al.  Combining integrated sampling with SVM ensembles for learning from imbalanced datasets , 2011, Inf. Process. Manag..

[16]  Yunming Ye,et al.  Classifying Very High-Dimensional Data with Random Forests Built from Small Subspaces , 2012, Int. J. Data Warehous. Min..