An Investigation of Credit Card Default Prediction in the Imbalanced Datasets

Financial threats are displaying a trend about the credit risk of commercial banks as the incredible improvement in the financial industry has arisen. In this way, one of the biggest threats faces by commercial banks is the risk prediction of credit clients. Recent studies mostly focus on enhancing the classifier performance for credit card default prediction rather than an interpretable model. In classification problems, an imbalanced dataset is also crucial to improve the performance of the model because most of the cases lied in one class, and only a few examples are in other categories. Traditional statistical approaches are not suitable to deal with imbalanced data. In this study, a model is developed for credit default prediction by employing various credit-related datasets. There is often a significant difference between the minimum and maximum values in different features, so Min-Max normalization is used to scale the features within one range. Data level resampling techniques are employed to overcome the problem of the data imbalance. Various undersampling and oversampling methods are used to resolve the issue of class imbalance. Different machine learning models are also employed to obtain efficient results. We developed the hypothesis of whether developed models using different machine learning techniques are significantly the same or different and whether resampling techniques significantly improves the performance of the proposed models. One-way Analysis of Variance is a hypothesis-testing technique, used to test the significance of the results. The split method is utilized to validate the results in which data has split into training and test sets. The results on imbalanced datasets show the accuracy of 66.9% on Taiwan clients credit dataset, 70.7% on South German clients credit dataset, and 65% on Belgium clients credit dataset. Conversely, the results using our proposed methods significantly improve the accuracy of 89% on Taiwan clients credit dataset, 84.6% on South German clients credit dataset, and 87.1% on Belgium clients credit dataset. The results show that the performance of classifiers is better on the balanced dataset as compared to the imbalanced dataset. It is also observed that the performance of data oversampling techniques are better than undersampling techniques. Overall, the Gradient Boosted Decision Tree method performs better than other traditional machine learning classifiers. The Gradient Boosted Decision Tree method gives the best results while utilizing the K-means SMOTE oversampling method. Using one-way ANOVA, the null hypothesis was rejected by a p-value <0.001, hence confirming that the proposed model improved performance is statistical significance. The interpretable model is also deployed on the web to ease the different stakeholders. This model will help commercial banks, financial organizations, loan institutes, and other decision-makers to predict the loan defaulter earlier.

[1]  Michal Tkác,et al.  Artificial neural networks in business: Two decades of research , 2016, Appl. Soft Comput..

[2]  Yi Peng,et al.  Multi-class misclassification cost matrix for credit ratings in peer-to-peer lending , 2020, J. Oper. Res. Soc..

[3]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[4]  Herbert Kimura,et al.  Machine learning models and bankruptcy prediction , 2017, Expert Syst. Appl..

[5]  Hussain Ali Bekhet,et al.  Credit risk assessment model for Jordanian commercial banks : neural scoring approach , 2014 .

[6]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[7]  Yong Hu,et al.  The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature , 2011, Decis. Support Syst..

[8]  Amalia Luque,et al.  The impact of class imbalance in classification performance metrics based on the binary confusion matrix , 2019, Pattern Recognit..

[9]  Kilian Q. Weinberger,et al.  Gradient boosted feature selection , 2014, KDD.

[10]  Francesco Ciampi,et al.  Corporate governance characteristics and default prediction modeling for small enterprises. An empirical analysis of Italian firms , 2015 .

[11]  Hoon Cho,et al.  An empirical study on credit card loan delinquency , 2018, Economic Systems.

[12]  Gianluca Bontempi,et al.  Learned lessons in credit card fraud detection from a practitioner perspective , 2014, Expert Syst. Appl..

[13]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[14]  Yi Peng,et al.  BEHAVIOR MONITORING METHODS FOR TRADE-BASED MONEY LAUNDERING INTEGRATING MACRO AND MICRO PRUDENTIAL REGULATION: A CASE FROM CHINA , 2019, Technological and Economic Development of Economy.

[15]  Eric Séverin,et al.  An investigation of bankruptcy prediction in imbalanced datasets , 2018, Decis. Support Syst..

[16]  Yufei Xia,et al.  Predicting loan default in peer‐to‐peer lending using narrative data , 2020, Journal of Forecasting.

[17]  Sanmay Das,et al.  Risk and Risk Management in the Credit Card Industry , 2015 .

[18]  Dongxi Liu,et al.  Performance Comparison and Current Challenges of Using Machine Learning Techniques in Cybersecurity , 2020, Energies.

[19]  Muhammad Atif,et al.  Cervical Cancer Prediction through Different Screening Methods using Data Mining , 2019, International Journal of Advanced Computer Science and Applications.

[20]  Shulin Wang,et al.  Feature selection in machine learning: A new perspective , 2018, Neurocomputing.

[21]  I-Cheng Yeh,et al.  The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients , 2009, Expert Syst. Appl..

[22]  Fawwad Hassan Jaskani,et al.  Comparison of Classification Models for Early Prediction of Breast Cancer , 2019, 2019 International Conference on Innovative Computing (ICIC).

[23]  Gang Kou,et al.  Retail investor attention and stock price crash risk: Evidence from China , 2019, International Review of Financial Analysis.

[24]  Muhammad Shoaib Farooq,et al.  Detection of Schistosomiasis Factors Using Association Rule Mining , 2019, IEEE Access.

[25]  K. Maddulety,et al.  Machine Learning in Banking Risk Management: A Literature Review , 2019, Risks.

[26]  Jing Zhou,et al.  Default prediction in P2P lending from high-dimensional data based on machine learning , 2019, Physica A: Statistical Mechanics and its Applications.

[27]  Kamran Shaukat,et al.  Student’s Performance: A Data Mining Perspective , 2017 .

[28]  Talha Mahboob Alam,et al.  Domain Analysis of Information Extraction Techniques , 2018 .

[29]  Matloob Khushi,et al.  Predicting High-Risk Prostate Cancer Using Machine Learning Methods , 2019, Data.

[30]  Yi Peng,et al.  MACHINE LEARNING METHODS FOR SYSTEMIC RISK ANALYSIS IN FINANCIAL SECTORS , 2019, Technological and Economic Development of Economy.

[31]  Kamran Shaukat,et al.  Student's performance in the context of data mining , 2016, 2016 19th International Multi-Topic Conference (INMIC).

[32]  Yan Yu,et al.  Financial ratios and bankruptcy predictions: An international evidence , 2017 .

[33]  José Francisco Martínez Trinidad,et al.  Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases , 2016, Neurocomputing.

[34]  Zhijun Ding,et al.  A hybrid interpretable credit card users default prediction model based on RIPPER , 2018, Concurr. Comput. Pract. Exp..

[35]  Che Lin,et al.  Enhanced Recurrent Neural Network for Combining Static and Dynamic Features for Credit Card Default Prediction , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Mohamed Elhoseny,et al.  Feature selection based on artificial bee colony and gradient boosting decision tree , 2019, Appl. Soft Comput..

[37]  Kamran Shaukat,et al.  A Socio-Technological analysis of Cyber Crime and Cyber Security in Pakistan , 2017 .

[38]  Fernando Bação,et al.  Oversampling for Imbalanced Learning Based on K-Means and SMOTE , 2017, Inf. Sci..

[39]  Wei Li,et al.  Transfer learning-based default prediction model for consumer credit in China , 2018, The Journal of Supercomputing.

[40]  Manoj Jayabalan,et al.  A Comparative Study on Credit Card Default Risk Predictive Model , 2019, Journal of Computational and Theoretical Nanoscience.

[41]  Yi Peng,et al.  Evaluation of clustering algorithms for financial risk analysis using MCDM methods , 2014, Inf. Sci..

[42]  Meherwar Fatima,et al.  Performance Comparison of Data Mining Algorithms for the Predictive Accuracy of Credit Card Defaulters , 2017 .

[43]  Shigeyuki Hamori,et al.  Ensemble Learning or Deep Learning? Application to Default Risk Analysis , 2018 .

[44]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[45]  Manzoor Ahmed Hashmani,et al.  Performance analysis of feature selection algorithm for educational data mining , 2017, 2017 IEEE Conference on Big Data and Analytics (ICBDA).

[46]  Yufei Xia,et al.  A novel heterogeneous ensemble credit scoring model based on bstacking approach , 2018, Expert Syst. Appl..

[47]  Jing Qiu,et al.  Dynamic ensemble classification for credit scoring using soft probability , 2018, Appl. Soft Comput..

[48]  Ana L. C. Bazzan,et al.  Balancing Training Data for Automated Annotation of Keywords: a Case Study , 2003, WOB.

[49]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[50]  Matloob Khushi,et al.  Reinforcement Learning in Financial Markets , 2019, Data.

[51]  Matloob Khushi,et al.  Corporate Bankruptcy Prediction: An Approach Towards Better Corporate World , 2020, Comput. J..

[52]  A. Lo,et al.  Consumer Credit Risk Models Via Machine-Learning Algorithms , 2010 .