On the suitability of resampling techniques for the class imbalance problem in credit scoring

In real-life credit scoring applications, the case in which the class of defaulters is under-represented in comparison with the class of non-defaulters is a very common situation, but it has still received little attention. The present paper investigates the suitability and performance of several resampling techniques when applied in conjunction with statistical and artificial intelligence prediction models over five real-world credit data sets, which have artificially been modified to derive different imbalance ratios (proportion of defaulters and non-defaulters examples). Experimental results demonstrate that the use of resampling methods consistently improves the performance given by the original imbalanced data. Besides, it is also important to note that in general, over-sampling techniques perform better than any under-sampling approach.

[1]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[2]  Clark R. Abrahams,et al.  Fair Lending Compliance: Intelligence and Implications for Credit Risk Management , 2008 .

[3]  Dirk Tasche,et al.  Estimating Probabilities of Default for Low Default Portfolios , 2004 .

[4]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[5]  R. Barandelaa,et al.  Strategies for learning in class imbalance problems , 2003, Pattern Recognit..

[6]  H. Keselman,et al.  Multiple Comparison Procedures , 2005 .

[7]  Yue Wang,et al.  Measuring Scorecard Performance , 2004, International Conference on Computational Science.

[8]  H. Sabzevari,et al.  A comparison between statistical and Data Mining methods for credit scoring in case of limited available data , 2007 .

[9]  Yue-Shi Lee,et al.  Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset , 2006 .

[10]  Lei Yang,et al.  Customer Credit Scoring Method Based on the SVDD Classification Model with Imbalanced Dataset , 2010, CETS.

[11]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[12]  Soushan Wu,et al.  Credit rating analysis with support vector machines and neural networks: a market comparative study , 2004, Decis. Support Syst..

[13]  Qi Fei,et al.  A comparative study of data mining methods in consumer loans credit scoring management , 2006 .

[14]  David A. Cieslak,et al.  Automatically countering imbalance and its empirical relationship to cost , 2008, Data Mining and Knowledge Discovery.

[15]  Hewijin Christine Jiau,et al.  Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem , 2006 .

[16]  Siddhartha Bhattacharyya,et al.  Data mining for credit card fraud: A comparative study , 2011, Decis. Support Syst..

[17]  C. G. Hilborn,et al.  The Condensed Nearest Neighbor Rule , 1967 .

[18]  David J. Hand,et al.  Choosing k for two-class nearest neighbour classifiers with unbalanced classes , 2003, Pattern Recognit. Lett..

[19]  D. Hand,et al.  Scorecard construction with unbalanced class sizes , 2003 .

[20]  A. Tamhane,et al.  Multiple Comparison Procedures , 1989 .

[21]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[22]  Christophe Mues,et al.  An experimental comparison of classification algorithms for imbalanced credit scoring data sets , 2012, Expert Syst. Appl..

[23]  Jan Vanthienen,et al.  50 years of data mining and OR: upcoming trends and challenges , 2009, J. Oper. Res. Soc..

[24]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[25]  David J. Hand,et al.  Construction of a k-nearest-neighbour credit-scoring system , 1997 .

[26]  Hussein A. Abdou,et al.  Credit Scoring, Statistical Techniques and Evaluation Criteria: A Review of the Literature , 2011, Intell. Syst. Account. Finance Manag..

[27]  Xinzhu Yang,et al.  Solving Credit Scoring Problem with Ensemble Learning: A Case Study , 2009, 2009 Second International Symposium on Knowledge Acquisition and Modeling.

[28]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[29]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[30]  Tomasz Maciejewski,et al.  Local neighbourhood extension of SMOTE for mining imbalanced data , 2011, 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM).

[31]  Bernd Engelmann,et al.  The Basel II risk parameters : estimation, validation, and stress testing , 2006 .

[32]  Kenneth Kennedy,et al.  Learning without Default: A Study of One-Class Classification and the Low-Default Portfolio Problem , 2009, AICS.

[33]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[34]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[35]  Jian Ma,et al.  A comparative assessment of ensemble learning for credit scoring , 2011, Expert Syst. Appl..

[36]  Johan A. K. Suykens,et al.  Benchmarking state-of-the-art classification algorithms for credit scoring , 2003, J. Oper. Res. Soc..

[37]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[38]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[39]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[40]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[41]  Ping Yao Comparative Study on Class Imbalance Learning for Credit Scoring , 2009, 2009 Ninth International Conference on Hybrid Intelligent Systems.

[42]  María José del Jesús,et al.  KEEL: a software tool to assess evolutionary algorithms for data mining problems , 2008, Soft Comput..

[43]  Tom Fawcett,et al.  Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions , 1997, KDD.

[44]  D. J. Hand,et al.  Good practice in retail credit scorecard assessment , 2005, J. Oper. Res. Soc..

[45]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[46]  Jonathan N. Crook,et al.  Credit Scoring and Its Applications , 2002, SIAM monographs on mathematical modeling and computation.