Distribution based ensemble for class imbalance learning

MultiBoost ensemble has been well acknowledged as an effective learning algorithm which able to reduce both bias and variance in error and has high generalization performance. However, to deal with the class imbalanced learning, the Multi- Boost shall be amended. In this paper, a new hybrid machine learning method called Distribution based MultiBoost (DBMB) for class imbalanced problems is proposed, which combines Distribution based balanced sampling with the MultiBoost algorithm to achieve better minority class performance. It minimizes the within class and between class imbalance by learning and sampling different distributions (Gaussian and Poisson) and reduces bias and variance in error by employing the MultiBoost ensemble. Therefore, DBMB could output the final strong learner that is more proficient ensemble of weak base learners for imbalanced data sets. We prove that the G-mean, F1 measure and AUC of the DBMB is significantly superior to others. The experimental verification has shown that the proposed DBMB outperforms other state-of-the-art algorithms on many real world class imbalanced problems. Furthermore, our proposed method is scalable as compare to other boosting methods.

[1]  Xin Yao,et al.  Resampling-Based Ensemble Methods for Online Class Imbalance Learning , 2015, IEEE Transactions on Knowledge and Data Engineering.

[2]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[3]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[4]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[5]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[6]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[7]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[8]  David J. Hand,et al.  Statistical fraud detection: A review , 2002 .

[9]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[10]  Foster Provost,et al.  The effect of class distribution on classifier learning: an empirical study , 2001 .

[11]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[12]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[13]  Edward Y. Chang,et al.  KBA: kernel boundary alignment considering imbalanced data distribution , 2005, IEEE Transactions on Knowledge and Data Engineering.

[14]  Le Gruenwald,et al.  A survey of data mining and knowledge discovery software tools , 1999, SKDD.

[15]  Jose Miguel Puerta,et al.  Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets , 2011, Expert Syst. Appl..

[16]  Reuven Y. Rubinstein,et al.  Modern simulation and modeling , 1998 .

[17]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[18]  Sheng Chen,et al.  A Kernel-Based Two-Class Classifier for Imbalanced Data Sets , 2007, IEEE Transactions on Neural Networks.

[19]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[20]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[21]  JapkowiczNathalie,et al.  The class imbalance problem: A systematic study , 2002 .

[22]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[23]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[24]  David A. Cieslak,et al.  Automatically countering imbalance and its empirical relationship to cost , 2008, Data Mining and Knowledge Discovery.

[25]  Glenn Fung,et al.  Multicategory Proximal Support Vector Machine Classifiers , 2005, Machine Learning.

[26]  Geoffrey I. Webb,et al.  MultiBoosting: A Technique for Combining Boosting and Wagging , 2000, Machine Learning.

[27]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.