Default prediction model: the significant role of data engineering in the quality of outcomes

For financial institutions and the banking industry, it is very crucial to have predictive models for their core financial activities, and especially those activities which play major roles in risk management. Predicting loan default is one of the critical issues that banks and financial institutions focus on, as huge revenue loss could be prevented by predicting customer’s ability not only to pay back, but also to be able to do that on time. Customer loan default prediction is a task of proactively identifying customers who are most probably to stop paying back their loans. This is usually done by dynamically analyzing customers’ relevant information and behaviors. This is significant so as the bank or the financial institution can estimate the borrowers’ risk. Many different machine learning classification models and algorithms have been used to predict customers’ ability to pay back loans. In this paper, three different classification methods (Naïve Bayes, Decision Tree, and Random Forest) are used for prediction, comprehensive different pre-processing techniques are being applied on the dataset in order to gain better data through fixing some of the main data issues like missing values and imbalanced data, and three different feature extractions algorithms are used to enhance the accuracy and the performance. Results of the competing models were varied after applying data preprocessing techniques and features selections. The results were compared using F1 accuracy measure. The best model achieved an improvement of about 40%, whilst the least performing model achieved an improvement of 3% only. This implies the significance and importance of data engineering (e.g., data preprocessing techniques and features selections) course of action in machine learning exercises.

[1]  Wing W. Y. Ng,et al.  Loan Default Prediction Using Diversified Sensitivity Undersampling , 2018, 2018 International Conference on Machine Learning and Cybernetics (ICMLC).

[2]  Kwee-Bo Sim,et al.  Genetic Algorithm Based Feature Selection Method Development for Pattern Recognition , 2006, 2006 SICE-ICASE International Joint Conference.

[3]  Wang Jian,et al.  Research and application of the improved algorithm C4.5 on Decision tree , 2009, 2009 International Conference on Test and Measurement.

[4]  Ajith Abraham,et al.  Modeling consumer loan default prediction using ensemble neural networks , 2013, 2013 INTERNATIONAL CONFERENCE ON COMPUTING, ELECTRICAL AND ELECTRONIC ENGINEERING (ICCEEE).

[5]  Djamel Bouchaffra,et al.  Random Forest and Filter Bank Common Spatial Patterns for EEG-Based Motor Imagery Classification , 2014, 2014 5th International Conference on Intelligent Systems, Modelling and Simulation.

[6]  Prince Kumar Singh,et al.  Prediction analysis of risky credit using Data mining classification models , 2017, 2017 8th International Conference on Computing, Communication and Networking Technologies (ICCCNT).

[7]  Tiannan Deng Study of the Prediction of Micro-Loan Default Based on Logit Model , 2019, 2019 International Conference on Economic Management and Model Engineering (ICEMME).

[8]  Hossein Nezamabadi-pour,et al.  Improved PSO-based feature construction algorithm using Feature Selection Methods , 2017, 2017 2nd Conference on Swarm Intelligence and Evolutionary Computation (CSIEC).

[9]  Yu Jin,et al.  A Data-Driven Approach to Predict Default Risk of Loan for Online Peer-to-Peer (P2P) Lending , 2015, 2015 Fifth International Conference on Communication Systems and Network Technologies.

[10]  Ghazi Al-Naymat,et al.  Loan Default Prediction Model Improvement through Comprehensive Preprocessing and Features Selection , 2019, 2019 International Arab Conference on Information Technology (ACIT).

[11]  Xue Yang,et al.  Research on Text Feature Selection Algorithm Based on Information Gain and Feature Relation Tree , 2013, 2013 10th Web Information System and Application Conference.

[12]  B. Kavitha,et al.  Neural Networks for Prediction of Loan Default Using Attribute Relevance Analysis , 2010, 2010 International Conference on Signal Acquisition and Processing.

[13]  Andrea Roli,et al.  A neural network approach for credit risk evaluation , 2008 .

[14]  Chun F. Hsu,et al.  Classification Methods of Credit Rating - A Comparative Analysis on SVM, MDA and RST , 2009, 2009 International Conference on Computational Intelligence and Software Engineering.

[15]  Li Xiang-wei,et al.  A Data Preprocessing Algorithm for Classification Model Based On Rough Sets , 2012 .

[16]  Kalyan Netti,et al.  A novel method for minimizing loss of accuracy in Naive Bayes classifier , 2015, 2015 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC).

[17]  Syed Zamil Hasan Shoumo,et al.  Application of Machine Learning in Credit Risk Assessment: A Prelude to Smart Banking , 2019, TENCON 2019 - 2019 IEEE Region 10 Conference (TENCON).