Integration of unsupervised and supervised machine learning algorithms for credit risk assessment

Abstract For the sake of credit risk assessment, credit scoring has become a critical tool to discriminate “bad” applicants from “good” applicants for financial institutions. Accordingly, a wide range of supervised machine learning algorithms have been successfully applied to credit scoring; however, integration of unsupervised learning with supervised learning in this field has drawn little consideration. In this work, we propose a combination strategy of integrating unsupervised learning with supervised learning for credit risk assessment. The difference between our work and other previous work on unsupervised integration is that we apply unsupervised learning techniques at two different stages: the consensus stage and dataset clustering stage. Comparisons of model performance are performed based on three credit datasets in four groups: individual models, individual models + consensus model, clustering + individual models, clustering + individual models + consensus model. As a result, integration at either the consensus stage or dataset clustering stage is effective on improving the performance of credit scoring models. Moreover, the combination of the two stages achieves the best performance, thereby confirming the superiority of the proposed integration of unsupervised and supervised machine learning algorithms, which boost our confidence that this strategy can be extended to many other credit datasets from financial institutions.

[1]  Ligang Zhou,et al.  Predicting the listing status of Chinese listed companies with multi-class classification models , 2016, Inf. Sci..

[2]  D. Hand,et al.  A k-nearest-neighbour classifier for assessing consumer credit risk , 1996 .

[3]  C. Holmes,et al.  A probabilistic nearest neighbour method for statistical pattern recognition , 2002 .

[4]  Raquel Florez-Lopez,et al.  Effects of missing data in credit risk scoring. A comparative analysis of methods to achieve robustness in the absence of sufficient data , 2010 .

[5]  Edward I. Altman,et al.  FINANCIAL RATIOS, DISCRIMINANT ANALYSIS AND THE PREDICTION OF CORPORATE BANKRUPTCY , 1968 .

[6]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[7]  Sid Lamrous,et al.  Divisive Hierarchical K-Means , 2006, 2006 International Conference on Computational Inteligence for Modelling Control and Automation and International Conference on Intelligent Agents Web Technologies and International Commerce (CIMCA'06).

[8]  Yue Kong,et al.  QSAR models for predicting the bioactivity of Polo-like Kinase 1 inhibitors , 2017 .

[9]  Raquel Flórez López,et al.  Enhancing accuracy and interpretability of ensemble strategies in credit risk assessment. A correlated-adjusted decision forest proposal , 2015, Expert Syst. Appl..

[10]  Teuvo Kohonen,et al.  The self-organizing map , 1990, Neurocomputing.

[11]  Tian-Shyug Lee,et al.  Mining the customer credit using classification and regression tree and multivariate adaptive regression splines , 2006, Comput. Stat. Data Anal..

[12]  Sirong Luo,et al.  Spline based survival model for credit risk modeling , 2016, Eur. J. Oper. Res..

[13]  R. Malhotra,et al.  Evaluating Consumer Loans using Neural Networks , 2003 .

[14]  José Salvador Sánchez,et al.  On the use of data filtering techniques for credit risk prediction with instance-based models , 2012, Expert Syst. Appl..

[15]  Aníbal R. Figueiras-Vidal,et al.  Pattern classification with missing data: a review , 2010, Neural Computing and Applications.

[16]  Jian Ma,et al.  Two credit scoring models based on dual strategy ensemble trees , 2012, Knowl. Based Syst..

[17]  Francisco Javier García Castellano,et al.  Expert Systems With Applications , 2022 .

[18]  Maysam Abbod,et al.  A systematic credit scoring model based on heterogeneous classifier ensembles , 2015, 2015 International Symposium on Innovations in Intelligent SysTems and Applications (INISTA).

[19]  Maher A. Sid-Ahmed,et al.  Investigating the Performance of Naive- Bayes Classifiers and K- Nearest Neighbor Classifiers , 2007 .

[20]  B. D. Ripley,et al.  Neural Networks for Pattern Recognition.@@@Pattern Recognition and Neural Networks. , 1997 .

[21]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[22]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[23]  Shashi Dahiya,et al.  A feature selection enabled hybrid‐bagging algorithm for credit risk evaluation , 2017, Expert Syst. J. Knowl. Eng..

[24]  A. I. Marqués,et al.  Exploring the behaviour of base classifiers in credit scoring ensembles , 2012, Expert Syst. Appl..

[25]  Bhekisipho Twala Impact of noise on credit risk prediction: Does data quality really matter? , 2013, Intell. Data Anal..

[26]  Kin Keung Lai,et al.  Least squares support vector machines ensemble models for credit scoring , 2010, Expert Syst. Appl..

[27]  Trupti M. Kodinariya,et al.  Review on determining number of Cluster in K-Means Clustering , 2013 .

[28]  Ning Chen,et al.  Improve credit scoring using transfer of learned knowledge from self-organizing map , 2016, Neural Computing and Applications.

[29]  David West,et al.  Neural network credit scoring models , 2000, Comput. Oper. Res..

[30]  Yufei Xia,et al.  A novel heterogeneous ensemble credit scoring model based on bstacking approach , 2018, Expert Syst. Appl..

[31]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[32]  Lyn C. Thomas,et al.  Does segmentation always improve model performance in credit scoring? , 2012, Expert Syst. Appl..

[33]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[34]  Ligang Zhou,et al.  Predicting the listing statuses of Chinese-listed companies using decision trees combined with an improved filter feature selection method , 2017, Knowl. Based Syst..

[35]  Bor-Wen Cheng,et al.  Prediction model building with clustering-launched classification and support vector machines in credit scoring , 2009, Expert Syst. Appl..

[36]  So Young Sohn,et al.  Technology credit scoring model with fuzzy logistic regression , 2016, Appl. Soft Comput..

[37]  Eibe Frank,et al.  Accuracy of machine learning models versus "hand crafted" expert systems - A credit scoring case study , 2009, Expert Syst. Appl..

[38]  Maysam F. Abbod,et al.  Classifiers consensus system approach for credit scoring , 2016, Knowl. Based Syst..

[39]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[40]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[41]  Ning Chen,et al.  Financial credit risk assessment: a recent review , 2015, Artificial Intelligence Review.

[42]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[43]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[44]  Pei-Chann Chang,et al.  Dynamic credit scoring using B & B with incremental-SVM-ensemble , 2015, Kybernetes.

[45]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[46]  Chunguang Zhou,et al.  Credit scoring algorithm based on link analysis ranking with support vector machine , 2009, Expert Syst. Appl..

[47]  Stefan Lessmann,et al.  Extreme learning machines for credit scoring: An empirical evaluation , 2017, Expert Syst. Appl..

[48]  Maysam F. Abbod,et al.  A new hybrid ensemble credit scoring model based on classifiers consensus system approach , 2016, Expert Syst. Appl..

[49]  Marcus T Scotti,et al.  Use of self-organizing maps and molecular descriptors to predict the cytotoxic activity of sesquiterpene lactones. , 2008, European journal of medicinal chemistry.

[50]  Bart Baesens,et al.  Failure prediction with self organizing maps , 2006, Expert Syst. Appl..

[51]  Jian Ma,et al.  A comparative assessment of ensemble learning for credit scoring , 2011, Expert Syst. Appl..

[52]  Shan Suthaharan,et al.  Support Vector Machine , 2016 .

[53]  J. Suykens,et al.  Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research , 2015, Eur. J. Oper. Res..

[54]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[55]  J. Wiginton A Note on the Comparison of Logit and Discriminant Models of Consumer Credit Behavior , 1980, Journal of Financial and Quantitative Analysis.

[56]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[57]  José Salvador Sánchez,et al.  Financial distress prediction using the hybrid associative memory with translation , 2016, Appl. Soft Comput..

[58]  Aixia Yan,et al.  Classification of Aurora kinase inhibitors by self-organizing map (SOM) and support vector machine (SVM). , 2013, European journal of medicinal chemistry.

[59]  Vural Aksakalli,et al.  Risk assessment in social lending via random forests , 2015, Expert Syst. Appl..

[60]  Arian Maleki,et al.  Geodesic K-means clustering , 2008, 2008 19th International Conference on Pattern Recognition.

[61]  Bhekisipho Twala,et al.  Multiple classifier application to credit risk assessment , 2010, Expert Syst. Appl..

[62]  Wuyi Yue,et al.  Support vector machine based multiagent ensemble learning for credit risk evaluation , 2010, Expert Syst. Appl..

[63]  David J. Hand,et al.  A survey of the issues in consumer credit modelling research , 2005, J. Oper. Res. Soc..