Preprocessing and Feature Extraction Methods for Microfinance Overdue Data

With rapid development of the microfinance industry, the number of customs has surged and the bad debt rate has risen dramatically. Increase of the overdue customers has led to a substantial augment in business volume in the collection industry. However, under the current policy of protecting customer privacy, the lack of credit information, as well as the constraints of collection’s cost and scale is two major issues that the collection industry comes across. This paper proposes a repayment probability forecasting system that does not rely on credit information, but can improve the collection efficiency. The proposed system focuses on preprocessing more than one hundred thousand overdue data, using word2vec to locate the keyword, extracting features of the data according to their types. Our system also depends on mature machine learning models to predict the customers’ ability of repayment, including LR, GBDT, XGBoost and RF. Meanwhile, we not only use AUC but also design a new evaluation index that can be adapted to the business background to evaluate the system’s performance. Experiments results show that, in the case of a surge in business volume and around 1.5% of the overdue costumers’ repayment, through our system, collection on only the first half of the customers with high scores can increase the repayment rate by at least 1.2%, which greatly increases the work efficiency and reduces manual labor for collection.

[1]  Yufei Xia,et al.  A boosted decision tree approach using Bayesian hyper-parameter optimization for credit scoring , 2017, Expert Syst. Appl..

[2]  Bjoern H. Menze,et al.  A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data , 2009, BMC Bioinformatics.

[3]  J. Dimopoulos,et al.  The Vienna applicator for combined intracavitary and interstitial brachytherapy of cervical cancer: clinical feasibility and preliminary results. , 2006, International journal of radiation oncology, biology, physics.

[4]  Éric Gaussier,et al.  A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation , 2005, ECIR.

[5]  Francesc Calafell,et al.  Minimizing recombinations in consensus networks for phylogeographic studies , 2009, BMC Bioinformatics.

[6]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[7]  Bart Baesens,et al.  Using Neural Network Rule Extraction and Decision Tables for Credit - Risk Evaluation , 2003, Manag. Sci..

[8]  M. Zekic-Susac,et al.  Small business credit scoring: a comparison of logistic regression, neural network, and decision tree models , 2004, 26th International Conference on Information Technology Interfaces, 2004..

[9]  Tom Bylander,et al.  Estimating Generalization Error on Two-Class Datasets Using Out-of-Bag Estimates , 2002, Machine Learning.

[10]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[11]  Zhaohui Zheng,et al.  Stochastic gradient boosted distributed decision trees , 2009, CIKM.

[12]  Ping Yao Credit Scoring Using Ensemble Machine Learning , 2009, 2009 Ninth International Conference on Hybrid Intelligent Systems.

[13]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[14]  James Wu,et al.  Foundations of Predictive Analytics (Chapman & Hall/CRC Data Mining and Knowledge Discovery Series) , 2012 .

[15]  Simon Fong,et al.  The Impact of Data Normalization on Stock Market Prediction: Using SVM and Technical Indicators , 2016, SCDS.

[16]  Jian Ma,et al.  A comparative assessment of ensemble learning for credit scoring , 2011, Expert Syst. Appl..

[17]  Shi Xi The Appliation of Canonical Discriminate Analysis in Credit Risk Evaluation of Enterprise , 2001 .

[18]  Irina Rish,et al.  An empirical study of the naive Bayes classifier , 2001 .

[19]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[20]  Wang Ling,et al.  Two/Too Simple Adaptations of Word2Vec for Syntax Problems , 2015, NAACL.

[21]  P. Jakubik,et al.  Non-Performing Loans: What Matters in Addition to the Economic Cycle? , 2013, SSRN Electronic Journal.

[22]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..