Improving Credit Risk Prediction in Online Peer-to-Peer (P2P) Lending Using Imbalanced Learning Techniques

Peer-to-peer (P2P) lending is a global trend of financial markets that allow individuals to obtain and concede loans without having financial institutions as a strong proxy. As many real-world applications, P2P lending presents an imbalanced characteristic, where the number of creditworthy loan requests is much larger than the number of non-creditworthy ones. In this work, we wrangle a real-world P2P lending data set from Lending Club, containing a large amount of data gathered from 2007 up to 2016. We analyze how supervised classification models and techniques to handle class imbalance impact creditworthiness prediction rates. Ensembles, cost-sensitive and sampling methods are combined and evaluated along logistic regression, decision tree, and bayesian learning schemes. Results show that, in average, sampling techniques outperform ensembles and cost sensitive approaches.

[1]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[2]  John K. Jackman,et al.  A selective sampling method for imbalanced data learning on support vector machines , 2010 .

[3]  Yonggwan Won,et al.  Classification of Unbalanced Medical Data with Weighted Regularized Least Squares , 2007, 2007 Frontiers in the Convergence of Bioscience and Information Technologies.

[4]  Luís Torgo,et al.  A Survey of Predictive Modeling on Imbalanced Domains , 2016, ACM Comput. Surv..

[5]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[6]  D. Cox The Regression Analysis of Binary Sequences , 2017 .

[7]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[8]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[9]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[10]  Xiaoli Ma,et al.  Sampling + reweighting: Boosting the performance of AdaBoost on imbalanced datasets , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[11]  Vural Aksakalli,et al.  Risk assessment in social lending via random forests , 2015, Expert Syst. Appl..

[12]  R. Vedala,et al.  An application of Naive Bayes classification for credit scoring in e-lending platform , 2012, 2012 International Conference on Data Science & Engineering (ICDSE).

[13]  Xin Yao,et al.  A multi-objective ensemble method for online class imbalance learning , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[14]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[15]  S. Natarajan,et al.  Credit Risk Analysis in Peer-to-Peer Lending System , 2016, 2016 IEEE International Conference on Knowledge Engineering and Applications (ICKEA).

[16]  Max Bramer Avoiding Overfitting of Decision Trees , 2013 .

[17]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[18]  Taghi M. Khoshgoftaar,et al.  Is Data Sampling Required When Using Random Forest for Classification on Imbalanced Bioinformatics Data? , 2016, Theoretical Information Reuse and Integration.

[19]  Jean Paul Barddal,et al.  A Survey on Ensemble Learning for Data Stream Classification , 2017, ACM Comput. Surv..

[20]  Nathalie Japkowicz,et al.  Synthetic Oversampling for Advanced Radioactive Threat Detection , 2015, 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA).

[21]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[22]  Hyunjoong Kim,et al.  RHSBoost: Improving classification performance in imbalance data , 2017, Comput. Stat. Data Anal..

[23]  Asem Kasem,et al.  Empirical Study of Sampling Methods for Classification in Imbalanced Clinical Datasets , 2016 .

[24]  Kevin Tsai,et al.  Peer Lending Risk Predictor , 2014 .

[25]  Lior Rokach,et al.  Taxonomy for characterizing ensemble methods in classification tasks: A review and annotated bibliography , 2009, Comput. Stat. Data Anal..

[26]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[27]  Hadi Sadoghi Yazdi,et al.  Online cost-sensitive neural network classifiers for non-stationary and imbalanced data streams , 2012, Neural Computing and Applications.

[28]  Konstantinos N. Topouzelis,et al.  Oil Spill Detection by SAR Images: Dark Formation Detection, Feature Extraction and Classification Algorithms , 2008, Sensors.

[29]  Gary Weiss,et al.  Does cost-sensitive learning beat sampling for classifying rare classes? , 2005, UBDM '05.

[30]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[31]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[32]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of Learning from Imbalanced Data Using Random Forest , 2007 .

[33]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[34]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[35]  Kay Chen Tan,et al.  Training cost-sensitive Deep Belief Networks on imbalance data problems , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).