Application of parallel distributed genetics-based machine learning to imbalanced data sets

Real world data sets are often imbalanced with respect to the class distribution. Classifier design from those data sets is relatively new challenge. The main problem is the lack of positive class patterns in the data sets. To deal with this problem, there are two main approaches. One is to additionally sample minority class patterns (i.e., over-sampling). The other is to sample a part of majority class patterns (i.e., under-sampling). In our previous research, we have proposed a parallel distributed genetics-based machine learning for large data sets. In our method, not only a population but also a training data set is divided into subgroups, respectively. A pair of a sub-population and a training data subset is assigned to an individual CPU core in order to reduce the computation time. In this paper, our parallel distributed approach is applied to imbalanced data sets. The training data subsets are constructed by a composition of subsets divided majority class patterns with the entire set of non-divided minority class patterns. Through computational experiments, we show the effectiveness of our parallel distributed approach with the proposed data subdivision schemes for imbalanced data sets.

[1]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[2]  Erick Cantú-Paz,et al.  A Survey of Parallel Genetic Algorithms , 2000 .

[3]  Huan Liu,et al.  Instance Selection and Construction for Data Mining , 2001 .

[4]  Francisco Herrera,et al.  On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining , 2006, Appl. Soft Comput..

[5]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[6]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[7]  María José del Jesús,et al.  On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets , 2010, Inf. Sci..

[8]  Hisao Ishibuchi,et al.  Training Data Subdivision and Periodical Rotation in Hybrid Fuzzy Genetics-Based Machine Learning , 2011, 2011 10th International Conference on Machine Learning and Applications and Workshops.

[9]  Hisao Ishibuchi,et al.  Hybridization of fuzzy GBML approaches for pattern classification problems , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[10]  H. Ishibuchi,et al.  Distributed representation of fuzzy rules and its application to pattern classification , 1992 .

[11]  Francisco Herrera,et al.  Stratification for scaling up evolutionary prototype selection , 2005, Pattern Recognit. Lett..

[12]  Francisco Herrera,et al.  SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory , 2012, Knowledge and Information Systems.

[13]  Hisao Ishibuchi,et al.  Parallel distributed genetic fuzzy rule selection , 2008, Soft Comput..

[14]  Enrique Alba,et al.  Parallelism and evolutionary algorithms , 2002, IEEE Trans. Evol. Comput..

[15]  Hisao Ishibuchi,et al.  Parallel Distributed Implementation of Genetics-Based Machine Learning for Fuzzy Classifier Design , 2010, SEAL.

[16]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[17]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[18]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[19]  Hisao Ishibuchi,et al.  Performance evaluation of fuzzy classifier systems for multidimensional pattern classification problems , 1999, IEEE Trans. Syst. Man Cybern. Part B.