Selection Stability in High Dimensional Statistical Modelling: Defining a Threshold for Robust Model Inference

Epidemiological research commonly involves identification of causal factors from within high dimensional (wide) data, where predictor variables outnumber observations. In this situation, however, conventional stepwise selection procedures perform poorly. Selection stability is one method to aid robust variable selection, by refitting a model to repeated resamples of the data and calculating the proportion of times each covariate is selected. A key problem when applying selection stability is to determine a threshold of stability above which a covariate is deemed ‘important’. In this research we describe and illustrate a two-step process to implement a stability threshold for covariate selection. Firstly, covariate stability distributions were established with a permuted model (randomly reordering the outcome to sever the relationship with predictors) using a cumulative distribution function. Subsequently, covariate stability was estimated using the true model outcome and covariates with a stability above a threshold defined from the permuted model, were selected in a final model. The proposed method performed well across 22 varied, simulated datasets with known outcomes; selection error rates were consistently lower than conventional implementation of equivalent models. This method of covariate selection appears to offer substantial advantages over current methods, to accurately identify the correct covariates from within a large, complex parameter space.