论文信息 - Classification of Imbalanced Marketing Data with Balanced Random Sets

Classification of Imbalanced Marketing Data with Balanced Random Sets

With imbalanced data a classifier built using all of the data has the tendency to ignore the minority class. To overcome this problem, we propose to use an ensemble classifier constructed on the basis of a large number of relatively small and balanced subsets, where representatives from both patterns are to be selected randomly. As an outcome, the system produces the matrix of linear regression coefficients whose rows represent the random subsets and the columns represent the features. Based on this matrix, we make an assessment of how stable the influence of a particular feature is. It is proposed to keep in the model only features with stable influence. The final model represents an average of the base-learners, which is not necessarily a linear regression. Proper data pre-processing is very important for the effectiveness of the whole system, and it is proposed to reduce the original data to the most simple binary sparse format, which is particularly convenient for the construction of decision trees. As a result, any particular feature will be represented by several binary variables or bins, which are absolutely equivalent in terms of data structure. This property is very important and may be used for feature selection. The proposed method exploits not only contributions of particular variables to the base-learners, but also the diversity of such contributions. Test results against KDD-2009 competition datasets are presented.

Geoffrey J. McLachlan | Vladimir Nikulin

[1] Vladimir Nikulin. Classification of Imbalanced Data with Random sets and Mean-Variance Filtering , 2008, Int. J. Data Warehous. Min..

[2] Arkadiusz Paterek,et al. Improving regularized singular value decomposition for collaborative filtering , 2007 .

[3] Alessandro Verri,et al. A Regularized Method for Selecting Nested Groups of Relevant Genes from Microarray Data , 2008, J. Comput. Biol..

[4] H. Sebastian Seung,et al. Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[5] Yanchun Zhang,et al. Bagging Support Vector Machine for Classification of SELDI-ToF Mass Spectra of Ovarian Cancer Serum Samples , 2007, Australian Conference on Artificial Intelligence.

[6] Leo Breiman,et al. Bagging Predictors , 1996, Machine Learning.

[7] Vladimir Nikulin. Learning with Mean-Variance Filtering, SVM and Gradient-based Optimization , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[8] D. Böhning. Multinomial logistic regression algorithm , 1992 .

[9] Luc Devroye,et al. Consistency of Random Forests and Other Averaging Classifiers , 2008, J. Mach. Learn. Res..

[10] Leo Breiman,et al. Random Forests , 2001, Machine Learning.