Distributed Methodologies for Imbalanced Classification Problems: Parameter Analysis and Tuning

Imbalanced classification problems represent a current challenge in data mining research, due to the classifiers' inability to produce sufficiently good models in such situations. We have previously proposed a general methodology for improving the performance of classifiers under imbalance conditions: ECSB -- Evolutionary Cost-Sensitive Balancing. This paper provides an empirical analysis on a distributed approach for ECSB (dECSB). The influence of the number of splits on the quality of the output classification model is studied on several data sets and J4.8 as base classifier. The data sets have been partitioned according to the imbalance ratio and the instances per attributes ratio. We found that the appropriate number of splits is highly dependent on the problem at hand, however, an influence of the two imbalance-related factors is present. The effect of altering the genetic settings has also been investigated, in the attempt to identify several values which constantly yield good results. Again, we found the results to be highly dependent on the problem, with some data sets exhibiting low performance variations due to the genetic settings.

[1]  Dragos D. Margineantu,et al.  When Does Imbalanced Data Require Cost-Sensitive Learning? , 2000 .

[2]  Herna L. Viktor,et al.  Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[3]  Alireza Aliamiri,et al.  STATISTICAL METHODS FOR UNEXPLODED ORDNANCE DISCRIMINATION , 2006 .

[4]  Jesús Cid-Sueiro,et al.  Improving Classification under Changes in Class and Within-Class Distributions , 2009, IWANN.

[5]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[6]  Wei Liu,et al.  Class Confidence Weighted kNN Algorithms for Imbalanced Data Sets , 2011, PAKDD.

[7]  Rodica Potolea,et al.  A Distributed Methodology for Imbalanced Classification Problems , 2012, 2012 11th International Symposium on Parallel and Distributed Computing.

[8]  David A. Cieslak,et al.  A Robust Decision Tree Algorithm for Imbalanced Data Sets , 2010, SDM.

[9]  María José del Jesús,et al.  Cost Sensitive and Preprocessing for Classification with Imbalanced Data-sets: Similar Behaviour and Potential Hybridizations , 2012, ICPRAM.

[10]  David J. Hand,et al.  Classifier Technology and the Illusion of Progress , 2006, math/0606441.

[11]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[12]  Kai Ming Ting,et al.  An Instance-weighting Method to Induce Cost-sensitive Trees , 2001 .

[13]  Qiang Yang,et al.  Test-cost sensitive naive Bayes classification , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[14]  Rodica Potolea,et al.  Imbalanced Classification Problems: Systematic Study, Issues and Best Practices , 2011, ICEIS.

[15]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[16]  David P. Williams,et al.  Mine Classification With Imbalanced Data , 2009, IEEE Geoscience and Remote Sensing Letters.