Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem

Most of the real-world data that are analyzed using nonlinear classification techniques are imbalanced in terms of the proportion of examples available for each class. This problem of imbalanced class distributions can lead the algorithms to learn overly complex models that overfit the data and have little relevance. Our study analyzes different classification algorithms that were employed to predict the creditworthiness of a bank's customers based on checking account information. A series of experiments were conducted to test the different techniques. The objective is to determine a range of credit scores that could be implemented by a manager for risk management. As a result, by realizing the concept of classification with equal quantities, the implicit knowledge can be discovered successfully. Subsequently, a strategy of data cleaning for handling such a real case with imbalanced distribution data is then proposed.

[1]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[2]  Mingui Sun,et al.  Detection of seizure foci by recurrent neural networks , 2000, Proceedings of the 22nd Annual International Conference of the IEEE Engineering in Medicine and Biology Society (Cat. No.00CH37143).

[3]  Jadzia Cendrowska,et al.  PRISM: An Algorithm for Inducing Modular Rules , 1987, Int. J. Man Mach. Stud..

[4]  R. Fletcher Practical Methods of Optimization , 1988 .

[5]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[6]  Bianca Zadrozny,et al.  Learning and making decisions when costs and probabilities are both unknown , 2001, KDD '01.

[7]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[8]  R. Pytlak A globally convergent conjugate gradient algorithm , 1993, Proceedings of 32nd IEEE Conference on Decision and Control.

[9]  Geoffrey E. Hinton,et al.  Learning representations of back-propagation errors , 1986 .

[10]  M. J. D. Powell,et al.  Radial basis functions for multivariable interpolation: a review , 1987 .

[11]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[12]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[13]  Gregory Piatetsky-Shapiro,et al.  The KDD process for extracting useful knowledge from volumes of data , 1996, CACM.

[14]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[15]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[16]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[17]  Tao Xiong,et al.  A combined SVM and LDA approach for classification , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[18]  Edward W. Kamen,et al.  New block recursive MLP training algorithms using the Levenberg-Marquardt algorithm , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[19]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[20]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[21]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[22]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[23]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[24]  Yuan Baozong,et al.  A fast hybrid algorithm of global optimization for feedforward neural networks , 2000, WCC 2000 - ICSP 2000. 2000 5th International Conference on Signal Processing Proceedings. 16th World Computer Congress 2000.

[25]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[26]  Duan Li,et al.  On Restart Procedures for the Conjugate Gradient Method , 2004, Numerical Algorithms.

[27]  R. Gerritsen Assessing loan risks: a data mining case study , 1999 .

[28]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[29]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[30]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.