PBC4cip: A new contrast pattern-based classifier for class imbalance problems

Abstract Contrast pattern-based classifiers are an important family of both understandable and accurate classifiers. Nevertheless, these classifiers do not achieve good performance on class imbalance problems. In this paper, we introduce a new contrast pattern-based classifier for class imbalance problems. Our proposal for solving the class imbalance problem combines the support of the patterns with the class imbalance level at the classification stage of the classifier. From our experimental results, using highly imbalanced databases, we can conclude that our proposed classifier significantly outperforms the current contrast pattern-based classifiers designed for class imbalance problems. Additionally, we show that our classifier significantly outperforms other state-of-the-art classifiers not directly based on contrast patterns, which are also designed to deal with class imbalance problems.

[1]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[2]  José Francisco Martínez Trinidad,et al.  Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases , 2016, Neurocomputing.

[3]  Der-Chiang Li,et al.  A learning method for the class imbalance problem with medical data sets , 2010, Comput. Biol. Medicine.

[4]  Yiguang Liu,et al.  Improving PART algorithm with K-L divergence for imbalanced classification , 2015, Intell. Data Anal..

[5]  Isel Grau,et al.  Mutating HIV Protease Protein Using Ant Colony Optimization and Fuzzy Cognitive Maps: Drug Susceptibility Analysis , 2014 .

[6]  Ian H. Witten,et al.  One-Class Classification by Combining Density and Class Probability Estimation , 2008, ECML/PKDD.

[7]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[8]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[9]  Liu Xiao,et al.  Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data , 2016 .

[10]  Francisco Herrera,et al.  Study on the Impact of Partition-Induced Dataset Shift on $k$-Fold Cross-Validation , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[11]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[12]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[13]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[14]  Oscar Cordón,et al.  Cost-Sensitive Learning of Fuzzy Rules for Imbalanced Classification Problems Using FURIA , 2014, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[15]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[16]  Longbing Cao,et al.  Effective detection of sophisticated online banking fraud on extremely imbalanced data , 2012, World Wide Web.

[17]  Ali Al-Shahib,et al.  Feature Selection and the Class Imbalance Problem in Predicting Protein Function from Sequence , 2005, Applied bioinformatics.

[18]  Francisco Herrera,et al.  Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling , 2011, Soft Comput..

[19]  H. Finner On a Monotonicity Problem in Step-Down Multiple Test Procedures , 1993 .

[20]  Keun Ho Ryu,et al.  Emerging Pattern Based Prediction of Heart Diseases and Powerline Safety , 2013, Contrast Data Mining.

[21]  Sunil Vadera,et al.  A survey of cost-sensitive decision tree induction algorithms , 2013, CSUR.

[22]  Juan José Rodríguez Diez,et al.  Random Balance: Ensembles of variable priors classifiers for imbalanced data , 2015, Knowl. Based Syst..

[23]  Thanh-Nghi Do,et al.  A Comparison of Different Off-Centered Entropies to Deal with Class Imbalance for Decision Trees , 2008, PAKDD.

[24]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[25]  William Zhu,et al.  A Competition Strategy to Cost-Sensitive Decision Trees , 2012, RSKT.

[26]  Kotagiri Ramamohanarao,et al.  Information-Based Classification by Aggregating Emerging Patterns , 2000, IDEAL.

[27]  Olatz Arbelaitz,et al.  Combining multiple class distribution modified subsamples in a single tree , 2007, Pattern Recognit. Lett..

[28]  Haibo He,et al.  Assessment Metrics for Imbalanced Learning , 2013 .

[29]  A. J. Rivera,et al.  A First Approach to Deal with Imbalance in Multi-label Datasets , 2013, HAIS.

[30]  Yiguang Liu,et al.  Improving Random Forest and Rotation Forest for highly imbalanced datasets , 2015, Intell. Data Anal..

[31]  Nicola Torelli,et al.  Training and assessing classification rules with imbalanced data , 2012, Data Mining and Knowledge Discovery.

[32]  H. S. Sheshadri,et al.  On the Classification of Imbalanced Datasets , 2012 .

[33]  Siti Mariyam Shamsuddin,et al.  Classification with class imbalance problem: A review , 2015, SOCO 2015.

[34]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[35]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[36]  Li-Chiu Chang,et al.  Forecasting of ozone episode days by cost-sensitive neural network methods. , 2009, The Science of the total environment.

[37]  Wei Liu,et al.  Class Confidence Weighted kNN Algorithms for Imbalanced Data Sets , 2011, PAKDD.

[38]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[39]  Ester Bernadó-Mansilla,et al.  Evolutionary rule-based systems for imbalanced data sets , 2008, Soft Comput..

[40]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[41]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[42]  Bartosz Krawczyk,et al.  Cost-Sensitive Splitting and Selection Method for Medical Decision Support System , 2012, IDEAL.

[43]  James Bailey,et al.  Statistical Measures for Contrast Patterns , 2013, Contrast Data Mining.

[44]  Kotagiri Ramamohanarao,et al.  A Robust Classifier for Imbalanced Datasets , 2014, PAKDD.

[45]  Ian H. Witten,et al.  Generating Accurate Rule Sets Without Global Optimization , 1998, ICML.

[46]  José Francisco Martínez Trinidad,et al.  Finding the best diversity generation procedures for mining contrast patterns , 2015, Expert Syst. Appl..

[47]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[48]  Yuming Zhou,et al.  A novel ensemble method for classifying imbalanced data , 2015, Pattern Recognit..

[49]  Antoine Geissbühler,et al.  Learning from imbalanced data in surveillance of nosocomial infection , 2006, Artif. Intell. Medicine.

[50]  Alberto Freitas Building cost-sensitive decision trees for medical applications , 2011, AI Commun..

[51]  Paolo Soda,et al.  A multi-objective optimisation approach for class imbalance learning , 2011, Pattern Recognit..

[52]  José Francisco Martínez Trinidad,et al.  LCMine: An efficient algorithm for mining discriminative regularities and its application in supervised classification , 2010, Pattern Recognit..

[53]  Francisco Herrera,et al.  On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed , 2014, Inf. Sci..

[54]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[55]  José Salvador Sánchez,et al.  Strategies for learning in class imbalance problems , 2003, Pattern Recognit..

[56]  Xiuzhen Zhang,et al.  Improving k Nearest Neighbor with Exemplar Generalization for Imbalanced Classification , 2011, PAKDD.

[57]  Marvin Meeng,et al.  Cost-based quality measures in subgroup discovery , 2014, Journal of Intelligent Information Systems.

[58]  Gerald Schaefer,et al.  Cost-sensitive decision tree ensembles for effective imbalanced classification , 2014, Appl. Soft Comput..

[59]  Jianping Fan,et al.  Cost-sensitive learning of hierarchical tree classifiers for large-scale image classification and novel category detection , 2015, Pattern Recognit..

[60]  David A. Cieslak,et al.  A Robust Decision Tree Algorithm for Imbalanced Data Sets , 2010, SDM.

[61]  Jinyan Li,et al.  CAEP: Classification by Aggregating Emerging Patterns , 1999, Discovery Science.

[62]  Francisco Charte,et al.  MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation , 2015, Knowl. Based Syst..

[63]  Gary M. Weiss,et al.  Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? , 2007, DMIN.

[64]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[65]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[66]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[67]  José Francisco Martínez Trinidad,et al.  The logical combinatorial approach to pattern recognition, an overview through selected works , 2001, Pattern Recognit..

[68]  Jinyan Li,et al.  Emerging Pattern Based Rules Characterizing Subtypes of Leukemia , 2013, Contrast Data Mining.

[69]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[70]  Yunming Ye,et al.  ForesTexter: An efficient random forest algorithm for imbalanced text categorization , 2014, Knowl. Based Syst..

[71]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[72]  Yunqian Ma,et al.  Foundations of Imbalanced Learning , 2013 .

[73]  Krzysztof Walczak,et al.  Emerging Patterns and Classification for Spatial and Image Data , 2013, Contrast Data Mining.

[74]  Bart Baesens,et al.  Comprehensible Credit Scoring Models Using Rule Extraction from Support Vector Machines , 2007, Eur. J. Oper. Res..

[75]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[76]  José Francisco Martínez Trinidad,et al.  A survey of emerging patterns for supervised classification , 2012, Artificial Intelligence Review.

[77]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[78]  Francisco Herrera,et al.  A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms , 2011, Swarm Evol. Comput..

[79]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[80]  Gary M. Weiss The Impact of Small Disjuncts on Classifier Learning , 2010, Data Mining.

[81]  Wen Gao,et al.  Face recognition using Ada-Boosted Gabor features , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[82]  María José del Jesús,et al.  KEEL: a software tool to assess evolutionary algorithms for data mining problems , 2008, Soft Comput..

[83]  Ronan Bureau,et al.  Emerging Patterns as Structural Alerts for Computational Toxicology , 2013, Contrast Data Mining.

[84]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[85]  Iñaki Albisua,et al.  The quest for the optimal class distribution: an approach for enhancing the effectiveness of learning via resampling methods for imbalanced data sets , 2013, Progress in Artificial Intelligence.

[86]  Guozhu Dong,et al.  Discriminating Gene Transfer and Microarray Concordance Analysis , 2013, Contrast Data Mining.

[87]  Francisco Herrera,et al.  SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering , 2015, Inf. Sci..

[88]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[89]  Xiuzhen Zhang,et al.  Overview and Analysis of Contrast Pattern Based Classification , 2013, Contrast Data Mining.

[90]  Gary M. Weiss Mining with Rare Cases , 2010, Data Mining and Knowledge Discovery Handbook.

[91]  David A. Cieslak,et al.  Hellinger distance decision trees are robust and skew-insensitive , 2011, Data Mining and Knowledge Discovery.

[92]  Guo-xia Dong 1 The Use of Emerging Patterns in the Analysis of Gene Expression Profiles for the Diagnosis and Understanding of Diseases , 2003 .

[93]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[94]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[95]  Francisco Herrera,et al.  Addressing imbalanced classification with instance generation techniques: IPADE-ID , 2014, Neurocomputing.

[96]  Ajalmar R. da Rocha Neto,et al.  A Cost Sensitive Minimal Learning Machine for Pattern Classification , 2015, ICONIP.

[97]  Olatz Arbelaitz,et al.  Coverage-based resampling: Building robust consolidated decision trees , 2015, Knowl. Based Syst..

[98]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.