Object selection in credit scoring using covariance matrix of parameters estimations

We address the problem of outlier detection for more reliable credit scoring. Scoring models are used to estimate the probability of loan default based on the customer’s application. To get an unbiased estimation of the model parameters one must select a set of informative objects (customers). We propose an object selection algorithm based on analysis of the covariance matrix for the estimated parameters of the model. To detect outliers we introduce a new quality function called specificity measure. For common practical case of ill-conditioned covariance matrix we suggest an empirical approximation of specificity. We illustrate the algorithm with eight benchmark datasets from the UCI machine learning repository and several artificial datasets. Computational experiments show statistical significance of the classification quality improvement for all considered datasets. The method is compared with four other widely used methods of outlier detection: deviance, Pearson and Bayesian residuals and gamma plots. Suggested method performs generally better for both clustered and non-clustered outliers. The method shows acceptable outlier discrimination for datasets that contain up to 30–40% of outliers.

[1]  Naeem Siddiqi,et al.  Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring , 2005 .

[2]  S. Chib,et al.  Bayesian residual analysis for binary response regression models , 1995 .

[3]  Martijn P. F. Berger,et al.  Optimal experimental designs for multilevel logistic models , 2001 .

[4]  D. Hosmer,et al.  Applied Logistic Regression , 1991 .

[5]  Vadim V. Strijov,et al.  Sample size determination for logistic regression , 2014, J. Comput. Appl. Math..

[6]  Peter Filzmoser,et al.  Outlier identification in high dimensions , 2008, Comput. Stat. Data Anal..

[7]  D. Montgomery,et al.  A comparative analysis of multiple outlier detection procedures in the linear regression model , 2001 .

[8]  J. Hardin,et al.  Generalized Linear Models and Extensions , 2001 .

[9]  Christophe Croux,et al.  Implementing the Bianco and Yohai estimator for logistic regression , 2003, Comput. Stat. Data Anal..

[10]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[11]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[12]  Arnold Neumaier,et al.  Solving Ill-Conditioned and Singular Linear Systems: A Tutorial on Regularization , 1998, SIAM Rev..

[13]  A. J. Jones,et al.  A proof of the Gamma test , 2002, Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[14]  Andrzej S. Kosinski,et al.  A procedure for the detection of multivariate outliers , 1998 .

[15]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[16]  Roger Newson,et al.  Review of Generalized Linear Models and Extensions by Hardin and Hilbe , 2001 .

[17]  Rodney X. Sturdivant,et al.  Applied Logistic Regression: Hosmer/Applied Logistic Regression , 2005 .

[18]  A. Afifi,et al.  On Tests for Multivariate Normality , 1973 .

[19]  S. Weisberg,et al.  Residuals and Influence in Regression , 1982 .

[20]  Anastasia Motrenko,et al.  Bayesian sample size estimation for logistic regression , 2012 .

[21]  Bin Li,et al.  REGULARIZED OPTIMIZATION IN STATISTICAL LEARNING: A BAYESIAN PERSPECTIVE , 2006 .

[22]  P. Rousseeuw,et al.  Unmasking Multivariate Outliers and Leverage Points , 1990 .

[23]  David M. Rocke,et al.  Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator , 2004, Comput. Stat. Data Anal..

[24]  David M. Sebert,et al.  A clustering algorithm for identifying multiple outliers in linear regression , 1998 .