Effects of missing data in credit risk scoring. A comparative analysis of methods to achieve robustness in the absence of sufficient data

Abstract The 2004 Basel II Accord has pointed out the benefits of credit risk management through internal models using internal data to estimate risk components: probability of default (PD), loss given default, exposure at default and maturity. Internal data are the primary data source for PD estimates; banks are permitted to use statistical default prediction models to estimate the borrowers’ PD, subject to some requirements concerning accuracy, completeness and appropriateness of data. However, in practice, internal records are usually incomplete or do not contain adequate history to estimate the PD. Current missing data are critical with regard to low default portfolios, characterised by inadequate default records, making it difficult to design statistically significant prediction models. Several methods might be used to deal with missing data such as list-wise deletion, application-specific list-wise deletion, substitution techniques or imputation models (simple and multiple variants). List-wise deletion is an easy-to-use method widely applied by social scientists, but it loses substantial data and reduces the diversity of information resulting in a bias in the model's parameters, results and inferences. The choice of the best method to solve the missing data problem largely depends on the nature of missing values (MCAR, MAR and MNAR processes) but there is a lack of empirical analysis about their effect on credit risk that limits the validity of resulting models. In this paper, we analyse the nature and effects of missing data in credit risk modelling (MCAR, MAR and NMAR processes) and take into account current scarce data set on consumer borrowers, which include different percents and distributions of missing data. The findings are used to analyse the performance of several methods for dealing with missing data such as likewise deletion, simple imputation methods, MLE models and advanced multiple imputation (MI) alternatives based on MarkovChain-MonteCarlo and re-sampling methods. Results are evaluated and discussed between models in terms of robustness, accuracy and complexity. In particular, MI models are found to provide very valuable solutions with regard to credit risk missing data.

[1]  M. Kenward,et al.  A comparison of multiple imputation and doubly robust estimation for analyses with missing data , 2006 .

[2]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[3]  Til Schuermann,et al.  Confidence Intervals for Probabilities of Default , 2005 .

[4]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[5]  Fred H. Cate,et al.  The Impact of Opt-In Privacy Rules on Retail Credit Markets: A Case Study of Mbna , 2003 .

[6]  Clifford C. Clogg,et al.  Handbook of statistical modeling for the social and behavioral sciences , 1995 .

[7]  S. Lipsitz,et al.  Missing-Data Methods for Generalized Linear Models , 2005 .

[8]  H. Boshuizen,et al.  Multiple imputation of missing blood pressure covariates in survival analysis. , 1999, Statistics in medicine.

[9]  R. Little Missing-Data Adjustments in Large Surveys , 1988 .

[10]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[11]  James D. Wright,et al.  Handbook of Survey Research. , 1985 .

[12]  G. King,et al.  Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation , 2001, American Political Science Review.

[13]  Walter A. Kosters,et al.  Genetic Programming for data classification: partitioning the search space , 2004, SAC '04.

[14]  Mark S. Carey,et al.  Parameterizing Credit Risk Models with Rating Data , 2001 .

[15]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[16]  T. Jacobson,et al.  Bank lending policy, credit scoring and value-at-risk , 2003 .

[17]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[18]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[19]  Alexander Basilevsky,et al.  Chapter 12 – Missing Data: A Review of the Literature , 1983 .

[20]  Kumar Chellapilla,et al.  Data mining using genetic programming: the implications of parsimony on generalization error , 1999, Proceedings of the 1999 Congress on Evolutionary Computation-CEC99 (Cat. No. 99TH8406).

[21]  G. King,et al.  What to Do about Missing Values in Time‐Series Cross‐Section Data , 2010 .

[22]  D. Rubin Multiple imputation for nonresponse in surveys , 1989 .

[23]  J. Schafer,et al.  A comparison of inclusive and restrictive strategies in modern missing data procedures. , 2001, Psychological methods.

[24]  David M. Rocke Robustness properties of S-estimators of multivariate location and shape in high dimension , 1996 .

[25]  Theo Stijnen,et al.  Using the outcome for imputation of missing predictor values was preferred. , 2006, Journal of clinical epidemiology.

[26]  Geert Molenberghs,et al.  Analyzing Incomplete Discrete Longitudinal Clinical Trial Data , 2006, math/0606497.

[27]  R. Sherman TESTS OF CERTAIN TYPES OF IGNORABLE NONRESPONSE IN SURVEYS SUBJECT TO ITEM NONRESPONSE OR ATTRITION , 2000 .

[28]  Ken P Kleinman,et al.  Much Ado About Nothing , 2007, The American statistician.

[29]  Stephen Evans,et al.  Last observation carry‐forward and last observation analysis , 2004, Statistics in medicine.

[30]  J. Ibrahim Incomplete Data in Generalized Linear Models , 1990 .

[31]  Nicholas J. Horton,et al.  A Potential for Bias When Rounding in Multiple Imputation , 2003 .

[32]  J. Hox,et al.  Prevention and treatment of item nonresponse. , 2003 .

[33]  Mu-Chen Chen,et al.  Credit scoring with a data mining approach based on support vector machines , 2007, Expert Syst. Appl..