Variable Selection in Data Mining

We predict the onset of personal bankruptcy using least squares regression. Although well publicized, only 2,244 bankruptcies occur in our dataset of 2.9 million months of credit-card activity. We use stepwise selection to find predictors of these from a mix of payment history, debt load, demographics, and their interactions. This combination of rare responses and over 67,000 possible predictors leads to a challenging modeling question: How does one separate coincidental from useful predictors? We show that three modifications turn stepwise regression into an effective methodology for predicting bankruptcy. Our version of stepwise regression (1) organizes calculations to accommodate interactions, (2) exploits modern decision theoretic criteria to choose predictors, and (3) conservatively estimates p-values to handle sparse data and a binary response. Omitting any one of these leads to poor performance. A final step in our procedure calibrates regression predictions. With these modifications, stepwise regression predicts bankruptcy as well as, if not better than, recently developed data-mining tools. When sorted, the largest 14,000 resulting predictions hold 1,000 of the 1,800 bankruptcies hidden in a validation sample of 2.3 million observations. If the cost of missing a bankruptcy is 200 times that of a false positive, our predictions incur less than 2/3 of the costs of classification errors produced by the tree-based classifier C4.5.

[1]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[2]  G. Bennett Probability Inequalities for the Sum of Independent Random Variables , 1962 .

[3]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[4]  C. L. Mallows Some comments on C_p , 1973 .

[5]  M. Plummer,et al.  International agency for research on cancer. , 2020, Archives of pathology.

[6]  J. Kalbfleisch Statistical Inference Under Order Restrictions , 1975 .

[7]  J. Goodnight A Tutorial on the SWEEP Operator , 1979 .

[8]  A. C. Rencher,et al.  Inflation of R2 in Best Subset Regression , 1980 .

[9]  H. White A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity , 1980 .

[10]  D. Freedman A Note on Screening Regression Equations , 1983 .

[11]  R. Simes,et al.  An improved Bonferroni procedure for multiple tests of significance , 1986 .

[12]  Tony Scallan,et al.  Elements of Statistical Computing: Numerical Computation , 1988 .

[13]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[14]  A. Atkinson Subset Selection in Regression , 1992 .

[15]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[16]  J. H. Schuenemeyer,et al.  Generalized Linear Models (2nd ed.) , 1992 .

[17]  John A. Nelder,et al.  Generalized linear models. 2nd ed. , 1993 .

[18]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[19]  D. M. Titterington,et al.  Neural Networks: A Review from a Statistical Perspective , 1994 .

[20]  D. M. Titterington,et al.  [Neural Networks: A Review from Statistical Perspective]: Rejoinder , 1994 .

[21]  Dean P. Foster,et al.  The risk inflation criterion for multiple regression , 1994 .

[22]  C. Chatfield Model uncertainty, data mining and statistical inference , 1995 .

[23]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[24]  Ramasamy Uthurusamy,et al.  Data mining and knowledge discovery in databases , 1996, CACM.

[25]  Michael P. Jones Indicator and stratification methods for missing explanatory variables in multiple linear regression , 1996 .

[26]  David J. Hand,et al.  Construction and Assessment of Classification Rules , 1997 .

[27]  Dean P. Foster,et al.  Business Analysis Using Regression - A Casebook , 1997 .

[28]  Steven Meester,et al.  Automating Credit and Collections Decisions at AT&T Capital Corporation , 1997 .

[29]  Dean Phillips Foster,et al.  Calibration and Empirical Bayes Variable Selection , 1997 .

[30]  David J. Hand,et al.  Statistical Classification Methods in Consumer Credit Scoring: a Review , 1997 .

[31]  A. McQuarrie,et al.  Regression and Time Series Model Selection , 1998 .

[32]  R. Tibshirani,et al.  The Covariance Inflation Criterion for Adaptive Model Selection , 1999 .

[33]  David B. Gross,et al.  An Empirical Analysis of Personal Bankruptcy and Delinquency , 1999 .

[34]  Niall M. Adams,et al.  Data Mining for Fun and Profit , 2000 .

[35]  Francisco Cribari-Neto,et al.  Improved heteroscedasticity‐consistent covariance matrix estimators , 2000 .

[36]  P. Gustafson Bayesian Regression Modeling with Interactions and Smooth Effects , 2000 .

[37]  R. Dennis Cook,et al.  A Note on Visualizing Response Transformations in Regression , 2001, Technometrics.

[38]  D. Mclone The Risk , 2001, Pediatric Neurosurgery.

[39]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[40]  P. Simpson,et al.  Statistical methods in cancer research , 2001, Journal of surgical oncology.

[41]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[42]  Jonathan N. Crook,et al.  Credit Scoring and Its Applications , 2002, SIAM monographs on mathematical modeling and computation.

[43]  D. Hand,et al.  Local Versus Global Models for Classification Problems , 2003 .

[44]  D. Hand,et al.  Scorecard construction with unbalanced class sizes , 2003 .

[45]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .