Predicting Classifiers Efficacy in Relation with Data Complexity Metric Using Under-Sampling Techniques

In imbalanced classification tasks, the training datasets may suffer from other problems like class overlapping, small disjuncts, classes of low density, etc. In such a situation, the learning for the minority class is imprecise. Data complexity metrics help us to identify the relationship between classifier’s learning accuracy and dataset characteristics. This paper presents an experimental study for imbalanced datasets wherein dwCM complexity metric is used to group the datasets based on the complexity level, thereafter the behavior of under-sampling based pre-processing techniques are analyzed for these different groups of datasets. Experiments are conducted on 22 real life datasets with different levels of imbalance, class overlapping and density of the classes. The experimental results show that these groups formed using dwCM metric can better explain the difficulty of imbalanced datasets and help in predicting the response of the classifiers to the under-sampling algorithms.

[1]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[2]  Luiz Eduardo Soares de Oliveira,et al.  A framework for dynamic classifier selection oriented by the classification problem difficulty , 2018, Pattern Recognit..

[3]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[4]  Anjana Gosain,et al.  Analysis of sampling based classification techniques to overcome class imbalancing , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[5]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[6]  Hualong Yu,et al.  Estimating harmfulness of class imbalance by scatter matrix based class separability measure , 2014, Intell. Data Anal..

[7]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Data Complexity Measures for Imbalanced Classification Tasks , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[8]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[9]  Albert Y. Zomaya,et al.  A Survey of Mobile Device Virtualization , 2016, ACM Comput. Surv..

[10]  Richard Baumgartner,et al.  Data complexity assessment in undersampled classification of high-dimensional biomedical data , 2006, Pattern Recognit. Lett..

[11]  T. Ho,et al.  Data Complexity in Pattern Recognition , 2006 .

[12]  Lorenzo Bruzzone,et al.  Classification of imbalanced remote-sensing data by neural networks , 1997, Pattern Recognit. Lett..

[13]  Kishan G. Mehrotra,et al.  An improved algorithm for neural network classification of imbalanced training sets , 1993, IEEE Trans. Neural Networks.

[14]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[15]  ZhouZhi-Hua,et al.  Training Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2006 .

[16]  Verónica Bolón-Canedo,et al.  Data complexity measures for analyzing the effect of SMOTE over microarrays , 2016, ESANN.

[17]  Ole K. Hejlesen,et al.  Preliminary Evaluation of Classification Complexity Measures on Imbalanced Data , 2013 .

[18]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[19]  Tin Kam Ho,et al.  Domain of competence of XCS classifier system in complexity measurement space , 2005, IEEE Transactions on Evolutionary Computation.

[20]  Ravi Kothari,et al.  Classifiability-based omnivariate decision trees , 2005, IEEE Transactions on Neural Networks.

[21]  Verónica Bolón-Canedo,et al.  Can classification performance be predicted by complexity measures? A study using microarray data , 2017, Knowledge and Information Systems.

[22]  Anju Saha,et al.  Weighted k‐nearest neighbor based data complexity metrics for imbalanced datasets , 2020, Stat. Anal. Data Min..

[23]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[24]  Geoff Jones,et al.  Measurement of data complexity for classification problems with unbalanced data , 2014, Stat. Anal. Data Min..

[25]  Misha Denil,et al.  Overlap versus Imbalance , 2010, Canadian Conference on AI.

[26]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[27]  Luís Torgo,et al.  A Survey of Predictive Modeling on Imbalanced Domains , 2016, ACM Comput. Surv..

[28]  Juan José Rodríguez Diez,et al.  Diversity techniques improve the performance of the best imbalance learning ensembles , 2015, Inf. Sci..

[29]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[30]  José Ramón Cano,et al.  Diagnose Effective Evolutionary Prototype Selection Using an Overlapping Measure , 2009, Int. J. Pattern Recognit. Artif. Intell..