A priori synthetic over-sampling methods for increasing classification sensitivity in imbalanced data sets

Compare OUPS and Safe Level OUPS against popular SMOTE generalizations.Safe Level OUPS resulted in the highest sensitivity and g-mean.OUPS modification did perform moderately well within neural networks.Safe Level OUPS improves prediction of noisy minority members using Linear SVM. Building accurate classifiers for predicting group membership is made difficult when using data that is skewed or imbalanced which is typical of real world data sets. The classifier has a tendency to be biased towards the over represented or majority group as a result. Re-sampling techniques offer simple approaches that can be used to minimize the effect. Over-sampling methods aim to combat class imbalance by increasing the number of minority group samples also refereed to as members of the minority group. Over the last decade SMOTE based methods have been used and extended to overcome this problem. There has been little emphasis on improvements to this approach with consideration to data intrinsic properties beyond that of class imbalance alone. In this paper we introduce modifications to a priori based methods Safe Level OUPS and OUPS that result in improvement for sensitivity measures over competing approaches using the SMOTE based method such as the Local neighborhood extension to SMOTE (LN-SMOTE), Borderline-SMOTE and Safe-Level-SMOTE.

[1]  Chee Khiang Pang,et al.  Kernel-based SMOTE for SVM classification of imbalanced datasets , 2015, IECON 2015 - 41st Annual Conference of the IEEE Industrial Electronics Society.

[2]  José Salvador Sánchez,et al.  Surrounding neighborhood-based SMOTE for learning from imbalanced data sets , 2012, Progress in Artificial Intelligence.

[3]  Francisco Herrera,et al.  SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering , 2015, Inf. Sci..

[4]  Tomasz Maciejewski,et al.  Local neighbourhood extension of SMOTE for mining imbalanced data , 2011, 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM).

[5]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[6]  N. Japkowicz Learning from Imbalanced Data Sets: A Comparison of Various Strategies * , 2000 .

[7]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[8]  Gustavo E. A. P. A. Batista,et al.  Learning with Class Skews and Small Disjuncts , 2004, SBIA.

[9]  Chidchanok Lursinsap,et al.  Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and AdaBoost techniques , 2013, Pattern Recognit. Lett..

[10]  Yan-Ping Zhang,et al.  Cluster-based majority under-sampling approaches for class imbalance learning , 2010, 2010 2nd IEEE International Conference on Information and Financial Engineering.

[11]  Nicolás García-Pedrajas,et al.  A scalable method for instance selection for class-imbalance datasets , 2011, 2011 11th International Conference on Intelligent Systems Design and Applications.

[12]  D. Rubin,et al.  Constructing a Control Group Using Multivariate Matched Sampling Methods That Incorporate the Propensity Score , 1985 .

[13]  Xiaoyi Jiang,et al.  Dynamic classifier ensemble model for customer classification with imbalanced class distribution , 2012, Expert Syst. Appl..

[14]  R. D'Agostino Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group. , 2005, Statistics in medicine.

[15]  P. Austin An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies , 2011, Multivariate behavioral research.

[16]  G. Imbens,et al.  Matching on the Estimated Propensity Score , 2009 .

[17]  William A. Rivera,et al.  Safe level OUPS for improving target concept learning in imbalanced data sets , 2015, SoutheastCon 2015.

[18]  Zhi-Hua Zhou,et al.  The Influence of Class Imbalance on Cost-Sensitive Learning: An Empirical Study , 2006, Sixth International Conference on Data Mining (ICDM'06).

[19]  Mark H Johnson,et al.  The development of spatial frequency biases in face recognition. , 2010, Journal of experimental child psychology.

[20]  Francisco Herrera,et al.  On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed , 2014, Inf. Sci..

[21]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[22]  Zhengding Qiu,et al.  The effect of imbalanced data sets on LDA: A theoretical and empirical analysis , 2007, Pattern Recognit..

[23]  Francisco Herrera,et al.  Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling , 2011, Soft Comput..

[24]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[25]  Francisco Herrera,et al.  Managing Borderline and Noisy Examples in Imbalanced Classification by Combining SMOTE with Ensemble Filtering , 2014, IDEAL.

[26]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[27]  Che-Chang Hsu,et al.  Bayesian decision theory for support vector machines: Imbalance measurement and feature optimization , 2011, Expert Syst. Appl..

[28]  Wu Qingfeng,et al.  An empirical study on ensemble selection for class-imbalance data sets , 2010, 2010 5th International Conference on Computer Science & Education.

[29]  Luís Torgo,et al.  SMOTE for Regression , 2013, EPIA.

[30]  Francisco Herrera,et al.  Addressing the Classification with Imbalanced Data: Open Problems and New Challenges on Class Distribution , 2011, HAIS.

[31]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[32]  José Salvador Sánchez,et al.  On the effectiveness of preprocessing methods when dealing with different levels of class imbalance , 2012, Knowl. Based Syst..

[33]  Gongping Yang,et al.  On the Class Imbalance Problem , 2008, 2008 Fourth International Conference on Natural Computation.

[34]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[35]  Mohammad Mansour Riahi Kashani,et al.  Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset , 2013, ArXiv.

[36]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[37]  Dominik Papies,et al.  The Cost Impact of Spam Filters: Measuring the Effect of Information System Technologies in Organizations , 2008, Inf. Syst. Res..

[38]  Wen-Chin Chen,et al.  Increasing the effectiveness of associative classification in terms of class imbalance by using a novel pruning algorithm , 2012, Expert Syst. Appl..

[39]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[40]  Herna L. Viktor,et al.  Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[41]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[42]  Gary M. Weiss The Impact of Small Disjuncts on Classifier Learning , 2010, Data Mining.

[43]  Amit Goel,et al.  OUPS: A Combined Approach Using SMOTE and Propensity Score Matching , 2014, 2014 13th International Conference on Machine Learning and Applications.

[44]  Yang Wang,et al.  Boosting for Learning Multiple Classes with Imbalanced Class Distribution , 2006, Sixth International Conference on Data Mining (ICDM'06).

[45]  Dae-Ki Kang,et al.  Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction , 2015, Expert Syst. Appl..

[46]  Lars Niklasson,et al.  Genetically Evolved Nearest Neighbor Ensembles , 2009 .

[47]  Bianca Zadrozny,et al.  Learning and making decisions when costs and probabilities are both unknown , 2001, KDD '01.

[48]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[49]  R. D'Agostino Adjustment Methods: Propensity Score Methods for Bias Reduction in the Comparison of a Treatment to a Non‐Randomized Control Group , 2005 .

[50]  Hong Gu,et al.  Imbalanced classification using support vector machine ensemble , 2011, Neural Computing and Applications.

[51]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[52]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[53]  Nitesh V. Chawla,et al.  Learning from Imbalanced Data: Evaluation Matters , 2012 .