Ensemble Approach for the Classification of Imbalanced Data

Ensembles are often capable of greater prediction accuracy than any of their individual members. As a consequence of the diversity between individual base-learners, an ensemble will not suffer from overfitting. On the other hand, in many cases we are dealing with imbalanced data and a classifier which was built using all data has tendency to ignore minority class. As a solution to the problem, we propose to consider a large number of relatively small and balanced subsets where representatives from the larger pattern are to be selected randomly. As an outcome, the system produces the matrix of linear regression coefficients whose rows represent random subsets and columns represent features. Based on the above matrix we make an assessment of how stable the influence of the particular features is. It is proposed to keep in the model only features with stable influence. The final model represents an average of the base-learners, which are not necessarily a linear regression. Test results against datasets of the PAKDD-2007 data-mining competition are presented.

[1]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[2]  Vladimir Nikulin Learning with Mean-Variance Filtering, SVM and Gradient-based Optimization , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[3]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[4]  Guo-Zheng Li,et al.  Overview of PAKDD Competition 2007 , 2008, Int. J. Data Warehous. Min..

[5]  Yanchun Zhang,et al.  Bagging Support Vector Machine for Classification of SELDI-ToF Mass Spectra of Ovarian Cancer Serum Samples , 2007, Australian Conference on Artificial Intelligence.

[6]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[7]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[8]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[9]  Luc Devroye,et al.  Consistency of Random Forests and Other Averaging Classifiers , 2008, J. Mach. Learn. Res..

[10]  Vladimir Nikulin Classification of Imbalanced Data with Random Sets and Mean-Variance Filtering. , 2010 .

[11]  Wenjia Wang,et al.  Some fundamental issues in ensemble methods , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  Mehmet A. Orgun,et al.  AI 2007: Advances in Artificial Intelligence, 20th Australian Joint Conference on Artificial Intelligence, Gold Coast, Australia, December 2-6, 2007, Proceedings , 2007, Australian Conference on Artificial Intelligence.

[14]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.