A graph-based semi-supervised reject inference framework considering imbalanced data distribution for consumer credit scoring

Abstract Credit scoring has been attracting increasing attention in the Chinese consumer financial industry. Traditional approaches are easily influenced by sample selection bias because they use accepted applicant samples only, while the applicant population also includes rejected applicants. Reject inference is a technique to infer good/bad labels for rejected applicants, which can overcome biases in credit scoring. However, previously proposed reject inference methods usually ignore the imbalanced distribution in accepted data, which means that good applicants are much more than bad ones in most practical consumer loan applications. Both the neglect of rejected data and the imbalanced distribution in accepted data weaken the performance of current credit scoring models. In this paper, we propose a novel reject inference framework that takes into account the imbalanced data distribution for consumer credit scoring. First, we use an advanced graph-based semi-supervised learning algorithm to solve the reject inference problem, which is called label spreading. Second, faced with an imbalanced distribution of good and bad samples in accepted applicants, we conduct imbalanced learning using a modified Synthetic Minority Over-sampling Technique before reject inference. Then, six binary classifiers are studied in our proposed framework for credit scoring modelling. Finally, we present the results of four exact experiments as well as online A/B tests for performance evaluation using data provided by a leading Chinese fintech company. Empirical results indicate that the proposed framework performs better than traditional scoring models across different evaluation metrics, representing a progressive method that promotes credit scoring research as well as improving fintech practices.

[1]  Yufei Xia,et al.  A boosted decision tree approach using Bayesian hyper-parameter optimization for credit scoring , 2017, Expert Syst. Appl..

[2]  Andrew Y. Ng,et al.  The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization , 2011, ICML.

[3]  X. W. Liang,et al.  LR-SMOTE - An improved unbalanced data set oversampling based on K-means and SVM , 2020, Knowl. Based Syst..

[4]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[5]  David J. Hand,et al.  Choosing k for two-class nearest neighbour classifiers with unbalanced classes , 2003, Pattern Recognit. Lett..

[6]  Billie Anderson Using Bayesian networks to perform reject inference , 2019, Expert Syst. Appl..

[7]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[8]  J. Suykens,et al.  Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research , 2015, Eur. J. Oper. Res..

[9]  Ashish Agarwal,et al.  Overlapping experiment infrastructure: more, better, faster experimentation , 2010, KDD.

[10]  Sven F. Crone,et al.  Instance sampling in credit scoring: An empirical study of sample size and balancing , 2012 .

[11]  Yijing Li,et al.  Learning from class-imbalanced data: Review of methods and applications , 2017, Expert Syst. Appl..

[12]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[13]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[14]  Asli Çelikyilmaz,et al.  A Graph-based Semi-Supervised Learning for Question-Answering , 2009, ACL.

[15]  Xian Sun,et al.  Multi-view semi-supervised learning for image classification , 2016, Neurocomputing.

[16]  Partha Pratim Talukdar,et al.  Experiments in Graph-Based Semi-Supervised Learning Methods for Class-Instance Acquisition , 2010, ACL.

[17]  Hai Su,et al.  SemiText: Scene text detection with semi-supervised learning , 2020, Neurocomputing.

[18]  Ye Zhao,et al.  Graph-Based Semi-supervised Learning for Fault Detection and Classification in Solar Photovoltaic Arrays , 2015, IEEE Transactions on Power Electronics.

[19]  Li Tong,et al.  Improving multi-class classification for endomicroscopic images by semi-supervised learning , 2017, 2017 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI).

[20]  Wei Yang,et al.  Reject inference in credit scoring using Semi-supervised Support Vector Machines , 2017, Expert Syst. Appl..

[21]  Jonathan Crook,et al.  Does reject inference really improve the performance of application scoring models , 2004 .

[22]  Nitesh V. Chawla,et al.  Learning From Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains , 2011, J. Artif. Intell. Res..

[23]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, ICDM.

[24]  Prem Timsina,et al.  Using Semi-Supervised Learning for the Creation of Medical Systematic Review: An Exploratory Analysis , 2016, 2016 49th Hawaii International Conference on System Sciences (HICSS).

[25]  Xin Gu,et al.  Cost-sensitive semi-supervised selective ensemble model for customer credit scoring , 2020, Knowl. Based Syst..

[26]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[27]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[28]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent , 1999, NIPS.

[29]  Jonathan Crook,et al.  Sample selection bias in credit scoring models , 2003, J. Oper. Res. Soc..

[30]  Marek Rei,et al.  Semi-supervised Multitask Learning for Sequence Labeling , 2017, ACL.

[31]  Byeong Ho Kang,et al.  Investigation and improvement of multi-layer perception neural networks for credit scoring , 2015, Expert Syst. Appl..

[32]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[33]  Jonathan Crook,et al.  Reject inference, augmentation, and sample selection , 2007, Eur. J. Oper. Res..

[34]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[35]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[36]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[37]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[38]  Walter Krämer,et al.  Reject inference in consumer credit scoring with nonignorable missing data , 2013 .

[39]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[40]  D. Glennon,et al.  Sample Selection Bias in Acquisition Credit Scoring Models: An Evaluation of the Supplemental-Data Approach , 2013 .

[41]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[42]  Terry Harris,et al.  Credit scoring using the clustered support vector machine , 2015, Expert Syst. Appl..

[43]  José Antônio Fernandes de Macêdo,et al.  A novel approach to define the local region of dynamic selection techniques in imbalanced credit scoring problems , 2020, Expert Syst. Appl..

[44]  Shian-Chang Huang,et al.  A new corporate credit scoring system using semi-supervised discriminant analysis , 2011 .

[45]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[46]  Jian Luo,et al.  A new approach for reject inference in credit scoring using kernel-free fuzzy quadratic surface support vector machines , 2018, Appl. Soft Comput..

[47]  T. Jayanthi,et al.  Weighted-SMOTE: A modification to SMOTE for event classification in sodium cooled fast reactors , 2017 .

[48]  H. Lilliefors On the Kolmogorov-Smirnov Test for Normality with Mean and Variance Unknown , 1967 .

[49]  Philip S. Yu,et al.  Mixture distribution modeling for scalable graph-based semi-supervised learning , 2020, Knowl. Based Syst..

[50]  Kenneth Kennedy,et al.  Using semi-supervised classifiers for credit scoring , 2013, J. Oper. Res. Soc..

[51]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[52]  Shuai Zhang,et al.  A novel ensemble method for credit scoring: Adaption of different imbalance ratios , 2018, Expert Syst. Appl..

[53]  Sebastián Maldonado,et al.  A Semi-supervised Approach for Reject Inference in Credit Scoring Using SVMs , 2010, ICDM.

[54]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[55]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[56]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[57]  Chun-Ling Chuang,et al.  A hybrid neural network approach for credit scoring , 2011, Expert Syst. J. Knowl. Eng..