EMDID: Evolutionary multi-objective discretization for imbalanced datasets

Abstract In recent years, imbalanced dataset classification has received significant attention due to its application in real-world problems, resulting in emergence of a new class of algorithms. Classification algorithms that work with discretized data have been shown to yield better performance. Thus, discretization is often a critical technique in data preprocessing. In this paper, a novel Evolutionary Multi-objective Discretization algorithm for binary Imbalanced Datasets (EMDID) is presented. The proposed algorithm takes advantage of evolutionary multi-objective optimization to simultaneously optimize three objective functions: (1) Area under the ROC curve (AUC); (2) the number of cut points; and (3) low-frequency cut points. The first objective function uses AUC, instead of classification accuracy, to choose better cut points so as to identify the minority class. The second objective function reduces the number of cut points while in the third objective function, low-frequency cut points are selected so that information loss caused by (continuous to discrete) data discretization is minimized. To evaluate the proposed algorithm, 25 imbalanced benchmark datasets are totally used and the results are compared to those of popular algorithms in the literature such as Class-Attribute Interdependence Maximization (CAIM) and Evolutionary Multi-objective Discretization (EMD). Our findings indicate that the proposed algorithm outperforms the other techniques in terms of the number of cut points, AUC, and non-parametric statistical tests.

[1]  Esmaeil Hadavandi,et al.  A study on siro, solo, compact, and conventional ring-spun yarns. Part III: modeling fiber migration using modular adaptive neuro-fuzzy inference system , 2013 .

[2]  Xu Yulong,et al.  A Two-step Discretization Algorithm Based on Rough Set , 2012, 2012 International Conference on Computer Science and Electronics Engineering.

[3]  Marek Lubicz,et al.  Boosted SVM for extracting rules from imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients , 2014, Appl. Soft Comput..

[4]  Md Zahidul Islam,et al.  Discretization of continuous attributes through low frequency numerical values and attribute interdependency , 2016, Expert Syst. Appl..

[5]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[6]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[7]  Francisco Herrera,et al.  A Compact Evolutionary Interval-Valued Fuzzy Rule-Based Classification System for the Modeling and Prediction of Real-World Financial Applications With Imbalanced Data , 2015, IEEE Transactions on Fuzzy Systems.

[8]  Kalyanmoy Deb,et al.  A Fast Elitist Non-dominated Sorting Genetic Algorithm for Multi-objective Optimisation: NSGA-II , 2000, PPSN.

[9]  Jerzy W. Grzymala-Busse,et al.  Discretization Based on Entropy and Multiple Scanning , 2013, Entropy.

[10]  Francis Eng Hock Tay,et al.  A Modified Chi2 Algorithm for Discretization , 2002, IEEE Trans. Knowl. Data Eng..

[11]  Jing Zhao,et al.  A novel Univariate Marginal Distribution Algorithm based discretization algorithm , 2012 .

[12]  Keqiu Li,et al.  UniDis: a universal discretization technique , 2012, Journal of Intelligent Information Systems.

[13]  P. N. Suganthan,et al.  An approach for classification of highly imbalanced data using weighting and undersampling , 2010, Amino Acids.

[14]  Luis González Abril,et al.  Ameva: An autonomous discretization algorithm , 2009, Expert Syst. Appl..

[15]  Krzysztof J. Cios,et al.  ur-CAIM: improved CAIM discretization for unbalanced and balanced data , 2016, Soft Comput..

[16]  Davy Janssens,et al.  Evaluating the performance of cost-based discretization versus entropy- and error-based discretization , 2006, Comput. Oper. Res..

[17]  Jamal Shahrabi,et al.  RipMC: RIPPER for Multiclass Classification , 2016, Neurocomputing.

[18]  Yingwei Jin,et al.  An effective discretization method for disposing high-dimensional data , 2014, Inf. Sci..

[19]  Geoffrey I. Webb,et al.  Discretization for naive-Bayes learning: managing discretization bias and variance , 2008, Machine Learning.

[20]  Der-Chiang Li,et al.  A learning method for the class imbalance problem with medical data sets , 2010, Comput. Biol. Medicine.

[21]  Francisco Herrera,et al.  Multivariate Discretization Based on Evolutionary Cut Points Selection for Classification , 2016, IEEE Transactions on Cybernetics.

[22]  Jianhong Wu,et al.  Supervised Discretization for Optimal Prediction , 2014 .

[23]  Zexuan Zhu,et al.  Markov blanket-embedded genetic algorithm for gene selection , 2007, Pattern Recognit..

[24]  Jamal Shahrabi,et al.  Complexity-based parallel rule induction for multiclass classification , 2017, Inf. Sci..

[25]  A. Govardhan,et al.  Improve the Classifier Accuracy for Continuous Attributes in Biomedical Datasets Using a New Discretization Method , 2014, ITQM.

[26]  Khurram Shehzad,et al.  EDISC: A Class-Tailored Discretization Technique for Rule-Based Classification , 2012, IEEE Transactions on Knowledge and Data Engineering.

[27]  Lukasz A. Kurgan,et al.  CAIM discretization algorithm , 2004, IEEE Transactions on Knowledge and Data Engineering.

[28]  Sirirut Vanichayobon,et al.  A novel discretization technique using Class Attribute Interval Average , 2014, 2014 Fourth International Conference on Digital Information and Communication Technology and its Applications (DICTAP).

[29]  Francisco Herrera,et al.  Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics , 2012, Expert Syst. Appl..

[30]  Bianca Zadrozny,et al.  Learning and making decisions when costs and probabilities are both unknown , 2001, KDD '01.

[31]  Klemens Böhm,et al.  Unsupervised interaction-preserving discretization of multivariate data , 2014, Data Mining and Knowledge Discovery.

[32]  Deqin Yan,et al.  A new approach for discretizing continuous attributes in learning systems , 2014, Neurocomputing.

[33]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[34]  Gary B. Lamont,et al.  Evolutionary Algorithms for Solving Multi-Objective Problems , 2002, Genetic Algorithms and Evolutionary Computation.

[35]  Jing Wang,et al.  A Supervised Statistical Data Quantization Method in Machine Learning , 2013, J. Multim..

[36]  T. Kathirvalavakumar,et al.  A new discretization algorithm based on range coefficient of dispersion and skewness for neural networks classifier , 2012, Appl. Soft Comput..

[37]  Yong-Gyu Jung,et al.  Using Weighted Hybrid Discretization Method to Analyze Climate Changes , 2012, FGIT-GDC/IESH/CGAG.

[38]  Nicandro Cruz-Ramírez,et al.  Application of time series discretization using evolutionary programming for classification of precancerous cervical lesions , 2014, J. Biomed. Informatics.

[39]  Longbing Cao,et al.  CD: A Coupled Discretization Algorithm , 2012, PAKDD.

[40]  Shahrokh Asadi,et al.  MEMOD: a novel multivariate evolutionary multi-objective discretization , 2017, Soft Computing.

[41]  Shahrokh Asadi,et al.  Development of an evolutionary fuzzy expert system for estimating future behavior of stock price , 2017 .

[42]  Jianhong Wu,et al.  Supervised Discretization with GK - τ , 2013, ITQM.

[43]  Ujjwal Maulik,et al.  A Survey of Multiobjective Evolutionary Algorithms for Data Mining: Part I , 2014, IEEE Transactions on Evolutionary Computation.

[44]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[45]  K. Lavangnananda,et al.  Study of discretization methods in classification , 2017, 2017 9th International Conference on Knowledge and Smart Technology (KST).

[46]  Mark Johnston,et al.  Developing New Fitness Functions in Genetic Programming for Classification With Unbalanced Data , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[47]  Adel M. A. Assiri,et al.  Thyroid Diagnosis based Technique on Rough Sets with Modified Similarity Relation , 2013 .