Default prediction in P2P lending from high-dimensional data based on machine learning

Abstract In recent years, a new Internet-based unsecured credit model, peer-to-peer (P2P) lending, is flourishing and has become a successful complement to the traditional credit business. However, credit risk remains inevitable. A key challenge is creating a default prediction model that can effectively and accurately predict the default probability of each loan for a P2P lending platform. Due to the characteristics of P2P lending credit data, such as high dimension and class imbalance, conventional statistical models and machine learning algorithms cannot effectively and accurately predict default probability. To address this issue, a decision tree model-based heterogeneous ensemble default prediction model is proposed in this paper for accurate prediction of customer default in P2P lending. Gradient boosting decision trees (GBDT), extreme gradient boosting (XGBoost) and light gradient boosting machine (LightGBM) are employed as individual classifiers to create a heterogeneous ensemble learning-based default prediction model. Learning model-based feature ranking is applied to P2P lending credit data, and individual classifiers undergo hyperparameter optimization. Finally, comparison with benchmark models shows that the prediction model can achieve desirable prediction results and thus effectively solve the challenge of predictions based on high-dimensional and imbalanced data.

[1]  Dezhu Ye,et al.  The role of punctuation in P2P lending: Evidence from China , 2018 .

[2]  Su Pan,et al.  Fuzzy-Rough Instance Selection Combined with Effective Classifiers in Credit Scoring , 2018, Neural Processing Letters.

[3]  Xiangliang Zhang,et al.  An up-to-date comparison of state-of-the-art classification algorithms , 2017, Expert Syst. Appl..

[4]  Xiaojun Ma,et al.  Study on a prediction of P2P network loan default based on the machine learning LightGBM and XGboost algorithms according to different high dimensional data cleaning , 2018, Electron. Commer. Res. Appl..

[5]  Lu Han,et al.  Orthogonal support vector machine for credit scoring , 2013, Eng. Appl. Artif. Intell..

[6]  J. Friedman Stochastic gradient boosting , 2002 .

[7]  Yufei Xia,et al.  A novel heterogeneous ensemble credit scoring model based on bstacking approach , 2018, Expert Syst. Appl..

[8]  B. B. Zaidan,et al.  A new algorithm of modified binary particle swarm optimization based on the Gustafson-Kessel for credit risk assessment , 2017, Neural Computing and Applications.

[9]  Tao Zhang,et al.  Multiple instance learning for credit risk assessment with transaction data , 2018, Knowl. Based Syst..

[10]  Hui-Jia Li,et al.  Multi-scale asynchronous belief percolation model on multiplex networks , 2019, New Journal of Physics.

[11]  Yongtang Shi,et al.  A new coupled disease-awareness spreading model with mass media on multiplex networks , 2019, Inf. Sci..

[12]  Jie Cao,et al.  GLEAM: a graph clustering framework based on potential game optimization for large-scale social networks , 2017, Knowledge and Information Systems.

[13]  Stefan Lessmann,et al.  Extreme learning machines for credit scoring: An empirical evaluation , 2017, Expert Syst. Appl..

[14]  Shanlin Yang,et al.  Heterogeneous Ensemble for Default Prediction of Peer-to-Peer Lending in China , 2018, IEEE Access.

[15]  Yufei Xia,et al.  Cost-sensitive boosted tree for loan evaluation in peer-to-peer lending , 2017, Electron. Commer. Res. Appl..

[16]  Jie Cao,et al.  Enhance the Performance of Network Computation by a Tunable Weighting Strategy , 2018, IEEE Transactions on Emerging Topics in Computational Intelligence.

[17]  Carlos Serrano-Cinca,et al.  The use of profit scoring as an alternative to credit scoring systems in peer-to-peer (P2P) lending , 2016, Decis. Support Syst..

[18]  Wei Li,et al.  Transfer learning-based default prediction model for consumer credit in China , 2018, The Journal of Supercomputing.

[19]  Chao Li,et al.  Improved centrality indicators to characterize the nodal spreading capability in complex networks , 2018, Appl. Math. Comput..

[20]  Wenyu Zhang,et al.  Classifier selection and clustering with fuzzy assignment in ensemble model for credit scoring , 2018, Neurocomputing.

[21]  Elisabeth André,et al.  Hand distinction for multi-touch tabletop interaction , 2009, ITS '09.

[22]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[23]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[24]  Aihua Li,et al.  Graph K-means Based on Leader Identification, Dynamic Game, and Opinion Dynamics , 2020, IEEE Transactions on Knowledge and Data Engineering.

[25]  Yung-Chia Chang,et al.  Application of eXtreme gradient boosting trees in the construction of credit risk assessment models for financial institutions , 2018, Appl. Soft Comput..

[26]  Jiguo Yu,et al.  An XGBoost-based physical fitness evaluation model using advanced feature selection and Bayesian hyper-parameter optimization for wearable running monitoring , 2019, Comput. Networks.

[27]  Khaled Ghédira,et al.  Rule-based credit risk assessment model using multi-objective evolutionary algorithms , 2019, Expert Syst. Appl..

[28]  Petr Hájek,et al.  Two-stage consumer credit risk modelling using heterogeneous ensemble learning , 2019, Decis. Support Syst..

[29]  Yufei Xia,et al.  A boosted decision tree approach using Bayesian hyper-parameter optimization for credit scoring , 2017, Expert Syst. Appl..

[30]  Hamido Fujita,et al.  Imbalanced enterprise credit evaluation with DTE-SBD: Decision tree ensemble based on SMOTE and bagging with differentiated sampling rates , 2018, Inf. Sci..

[31]  Carlo Vercellis,et al.  Linear versus nonlinear dimensionality reduction for banks' credit rating prediction , 2013, Knowl. Based Syst..

[32]  Desheng Dash Wu,et al.  A deep learning approach for credit scoring using credit default swaps , 2017, Eng. Appl. Artif. Intell..

[33]  V. V. Strelkov,et al.  A new similarity measure for histogram comparison and its application in time series analysis , 2008, Pattern Recognit. Lett..

[34]  J. Suykens,et al.  Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research , 2015, Eur. J. Oper. Res..

[35]  Aihua Li,et al.  Fast and Accurate Mining the Community Structure: Integrating Center Locating and Membership Optimization , 2016, IEEE Transactions on Knowledge and Data Engineering.

[36]  Fang Fang,et al.  A new approach for credit scoring by directly maximizing the Kolmogorov-Smirnov statistic , 2019, Comput. Stat. Data Anal..

[37]  Mingxi Liu,et al.  A novel cryptocurrency price trend forecasting model based on LightGBM , 2020 .

[38]  Jie Cao,et al.  Dynamic Cluster Formation Game for Attributed Graph Clustering , 2019, IEEE Transactions on Cybernetics.

[39]  Adel Hatami-Marbini,et al.  A fuzzy decision support system for credit scoring , 2018, Neural Computing and Applications.

[40]  Dirk Helbing,et al.  Saving Human Lives: What Complexity Science and Information Systems can Contribute , 2014, Journal of statistical physics.

[41]  Vural Aksakalli,et al.  Risk assessment in social lending via random forests , 2015, Expert Syst. Appl..