A Gaussian mixture model based combined resampling algorithm for classification of imbalanced credit data sets

Credit scoring represents a two-classification problem. Moreover, the data imbalance of the credit data sets, where one class contains a small number of data samples and the other contains a large number of data samples, is an often problem. Therefore, if only a traditional classifier is used to classify the data, the final classification effect will be affected. To improve the classification of the credit data sets, a Gaussian mixture model based combined resampling algorithm is proposed. This resampling approach first determines the number of samples of the majority class and the minority class using a sampling factor. Then, the Gaussian mixture clustering is used for undersampling of the majority of samples, and the synthetic minority oversampling technique is used for the rest of the samples, so an eventual imbalance problem is eliminated. Here we compare several resampling methods commonly used in the analysis of imbalanced credit data sets. The obtained experimental results demonstrate that the proposed method consistently improves classification performances such as F-measure, AUC, G-mean, and so on. In addition, the method has strong robustness for credit data sets.

[1]  J. Wiginton A Note on the Comparison of Logit and Discriminant Models of Consumer Credit Behavior , 1980, Journal of Financial and Quantitative Analysis.

[2]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[3]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[4]  Xu Kai Research on Extreme Risk Warning for Financial Market Based on RU-SMOTE-SVM , 2013 .

[5]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[6]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7]  Yanling Li,et al.  Data Imbalance Problem in Text Classification , 2010, 2010 Third International Symposium on Information Processing.

[8]  David A. Cieslak,et al.  Automatically countering imbalance and its empirical relationship to cost , 2008, Data Mining and Knowledge Discovery.

[9]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[10]  William Zhu,et al.  A Competition Strategy to Cost-Sensitive Decision Trees , 2012, RSKT.

[11]  Hadi Sadoghi Yazdi,et al.  Ensemble of online neural networks for non-stationary and imbalanced data streams , 2013, Neurocomputing.

[12]  José Salvador Sánchez,et al.  On the suitability of resampling techniques for the class imbalance problem in credit scoring , 2013, J. Oper. Res. Soc..

[13]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[14]  Muhaini Othman,et al.  Evolving spiking neural networks for personalised modelling, classification and prediction of spatio-temporal patterns with a case study on stroke , 2014, Neurocomputing.

[15]  N Chawla,et al.  SMOTEBoost:ブースティングにおけるマイノリティクラスの予測改善(原標題は英語) , 2003 .

[16]  David J. Hand,et al.  Statistical Classification Methods in Consumer Credit Scoring: a Review , 1997 .

[17]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[18]  Edward I. Altman,et al.  Corporate distress diagnosis: Comparisons using linear discriminant analysis and neural networks (the Italian experience) , 1994 .

[19]  Robert B. Fisher,et al.  Classifying imbalanced data sets using similarity based hierarchical decomposition , 2015, Pattern Recognit..

[20]  Alberto Freitas Building cost-sensitive decision trees for medical applications , 2011, AI Commun..

[21]  Christophe Mues,et al.  An experimental comparison of classification algorithms for imbalanced credit scoring data sets , 2012, Expert Syst. Appl..

[22]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[23]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[24]  Jan Vanthienen,et al.  50 years of data mining and OR: upcoming trends and challenges , 2009, J. Oper. Res. Soc..

[25]  Yingxu Yang,et al.  Adaptive credit scoring with kernel learning methods , 2007, Eur. J. Oper. Res..

[26]  Xin Yao,et al.  Diversity analysis on imbalanced data sets by using ensemble models , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[27]  Bartosz Krawczyk,et al.  Cost-Sensitive Splitting and Selection Method for Medical Decision Support System , 2012, IDEAL.

[28]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[29]  Jian Ma,et al.  A comparative assessment of ensemble learning for credit scoring , 2011, Expert Syst. Appl..

[30]  Ciza Thomas,et al.  Improving intrusion detection for imbalanced network traffic , 2013, Secur. Commun. Networks.

[31]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[32]  Dong Zhou,et al.  Translation techniques in cross-language information retrieval , 2012, CSUR.

[33]  Johan A. K. Suykens,et al.  Benchmarking state-of-the-art classification algorithms for credit scoring , 2003, J. Oper. Res. Soc..

[34]  Edward I. Altman,et al.  FINANCIAL RATIOS, DISCRIMINANT ANALYSIS AND THE PREDICTION OF CORPORATE BANKRUPTCY , 1968 .

[35]  Yijing Li,et al.  Learning from class-imbalanced data: Review of methods and applications , 2017, Expert Syst. Appl..

[36]  David A. Cieslak,et al.  A Robust Decision Tree Algorithm for Imbalanced Data Sets , 2010, SDM.

[37]  Jonathan N. Crook,et al.  Credit Scoring and Its Applications , 2002, SIAM monographs on mathematical modeling and computation.

[38]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[39]  Maher Maalouf,et al.  Computational Statistics and Data Analysis Robust Weighted Kernel Logistic Regression in Imbalanced and Rare Events Data , 2022 .

[40]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[41]  Soushan Wu,et al.  Credit rating analysis with support vector machines and neural networks: a market comparative study , 2004, Decis. Support Syst..

[42]  Maryam Gholami Doborjeh,et al.  Mapping, Learning, Visualization, Classification, and Understanding of fMRI Data in the NeuCube Evolving Spatiotemporal Data Machine of Spiking Neural Networks , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[43]  Gerald Schaefer,et al.  Cost-sensitive decision tree ensembles for effective imbalanced classification , 2014, Appl. Soft Comput..

[44]  T. Y. Liu Feature selection based on mutual information for gear imbalanced problem faulty diagnosis , 2012 .

[45]  Thanh-Nghi Do,et al.  A Comparison of Different Off-Centered Entropies to Deal with Class Imbalance for Decision Trees , 2008, PAKDD.

[46]  David West,et al.  Neural network credit scoring models , 2000, Comput. Oper. Res..

[47]  Dirk Tasche,et al.  Estimating Probabilities of Default for Low Default Portfolios , 2004 .

[48]  José Salvador Sánchez,et al.  On the use of data filtering techniques for credit risk prediction with instance-based models , 2012, Expert Syst. Appl..

[49]  Jerzy Stefanowski,et al.  Neighbourhood sampling in bagging for imbalanced data , 2015, Neurocomputing.

[50]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[51]  David A. Cieslak,et al.  Combating imbalance in network intrusion datasets , 2006, 2006 IEEE International Conference on Granular Computing.

[52]  Ying He,et al.  MSMOTE: Improving Classification Performance When Training Data is Imbalanced , 2009, 2009 Second International Workshop on Computer Science and Engineering.

[53]  Nikola Kasabov Evolving connectionist systems for adaptive learning and knowledge discovery: Methods, tools, applications , 2002, Proceedings First International IEEE Symposium Intelligent Systems.

[54]  Johan L. Perols Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms , 2011 .

[55]  Der-Chiang Li,et al.  A learning method for the class imbalance problem with medical data sets , 2010, Comput. Biol. Medicine.

[56]  Francisco Herrera,et al.  EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling , 2013, Pattern Recognit..

[57]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[58]  Sunil Vadera,et al.  A survey of cost-sensitive decision tree induction algorithms , 2013, CSUR.

[59]  J. Crook,et al.  Credit scoring using neural and evolutionary techniques , 2000 .

[60]  Iñaki Albisua,et al.  The quest for the optimal class distribution: an approach for enhancing the effectiveness of learning via resampling methods for imbalanced data sets , 2013, Progress in Artificial Intelligence.

[61]  Daniel Enache,et al.  Analyzing Credit Risk Data: A Comparison of Logistic Discrimination, Classification Tree Analysis, a , 1997 .

[62]  Vijay S. Desai,et al.  A comparison of neural networks and linear scoring models in the credit union environment , 1996 .

[63]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[64]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[65]  A. Steenackers,et al.  A credit scoring model for personal loans , 1989 .

[66]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.