On the impact of disproportional samples in credit scoring models: An application to a Brazilian bank data

Statistical methods have been widely employed to assess the capabilities of credit scoring classification models in order to reduce the risk of wrong decisions when granting credit facilities to clients. The predictive quality of a classification model can be evaluated based on measures such as sensitivity, specificity, predictive values, accuracy, correlation coefficients and information theoretical measures, such as relative entropy and mutual information. In this paper we analyze the performance of a naive logistic regression model (Hosmer & Lemeshow, 1989) and a logistic regression with state-dependent sample selection model (Cramer, 2004) applied to simulated data. Also, as a case study, the methodology is illustrated on a data set extracted from a Brazilian bank portfolio. Our simulation results so far revealed that there is no statistically significant difference in terms of predictive capacity between the naive logistic regression models and the logistic regression with state-dependent sample selection models. However, there is strong difference between the distributions of the estimated default probabilities from these two statistical modeling techniques, with the naive logistic regression models always underestimating such probabilities, particularly in the presence of balanced samples.

[1]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[2]  L. Breiman Arcing classifier (with discussion and a rejoinder by the author) , 1998 .

[3]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[4]  Johan A. K. Suykens,et al.  Benchmarking state-of-the-art classification algorithms for credit scoring , 2003, J. Oper. Res. Soc..

[5]  Vijay S. Desai,et al.  A comparison of neural networks and linear scoring models in the credit union environment , 1996 .

[6]  Dennis L. Hoffman,et al.  An econometric analysis of the bank credit scoring problem , 1989 .

[7]  Graham Dunn,et al.  Clinical Biostatistics: An Introduction to Evidence-Based Medicine , 1995 .

[8]  B. Rost,et al.  Redefining the goals of protein secondary structure prediction. , 1994, Journal of molecular biology.

[9]  A. Steenackers,et al.  A credit scoring model for personal loans , 1989 .

[10]  David J. Hand,et al.  Classifier Technology and the Illusion of Progress , 2006, math/0606441.

[11]  Chih-Chou Chiu,et al.  Credit scoring using the hybrid neural discriminant technique , 2002, Expert Syst. Appl..

[12]  W. Greene Sample selection in credit-scoring models1 , 1998 .

[13]  Solomon Kullback,et al.  Information Theory and Statistics , 1960 .

[14]  J. Baron Thinking and Deciding , 2023 .

[15]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[16]  Zhi-Xin Wang Assessing the accuracy of protein secondary structure , 1994, Nature Structural Biology.

[17]  Tian-Shyug Lee,et al.  A two-stage hybrid credit scoring model using artificial neural networks and multivariate adaptive regression splines , 2005, Expert Syst. Appl..

[18]  E. Altman,et al.  Managing Credit Risk: The Next Great Financial Challenge , 1998 .

[19]  So Young Sohn,et al.  Cluster-based dynamic scoring model , 2007, Expert Syst. Appl..

[20]  J. Cramer,et al.  Scoring Bank Loans that may go wrong – A Case Study , 2000 .

[21]  David J. Hand,et al.  Statistical Classification Methods in Consumer Credit Scoring: a Review , 1997 .

[22]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[23]  Jonathan N. Crook,et al.  Credit Scoring and Its Applications , 2002, SIAM monographs on mathematical modeling and computation.

[24]  L. Thomas A survey of credit and behavioural scoring: forecasting financial risk of lending to consumers , 2000 .

[25]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[26]  Jonathan Crook,et al.  Scoring by usage , 2001, J. Oper. Res. Soc..

[27]  N. Šarlija,et al.  Multinomial model in consumer credit scoring , 2005 .

[28]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[29]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[30]  Gustavo A. Stolovitzky,et al.  Bioinformatics: The Machine Learning Approach , 2002 .