A novel case‐control subsampling approach for rapid model exploration of large clustered binary data

In many settings, an analysis goal is the identification of a factor, or set of factors associated with an event or outcome. Often, these associations are then used for inference and prediction. Unfortunately, in the big data era, the model building and exploration phases of analysis can be time-consuming, especially if constrained by computing power (ie, a typical corporate workstation). To speed up this model development, we propose a novel subsampling scheme to enable rapid model exploration of clustered binary data using flexible yet complex model set-ups (GLMMs with additive smoothing splines). By reframing the binary response prospective cohort study into a case-control-type design, and using our knowledge of sampling fractions, we show one can approximate the model estimates as would be calculated from a full cohort analysis. This idea is extended to derive cluster-specific sampling fractions and thereby incorporate cluster variation into an analysis. Importantly, we demonstrate that previously computationally prohibitive analyses can be conducted in a timely manner on a typical workstation. The approach is applied to analysing risk factors associated with adverse reactions relating to blood donation.

[1]  Euclid,et al.  Statistical science : a review journal of the Institute of Mathematical Statistics. , 1986 .

[2]  Haibo Zhou,et al.  Outcome- and Auxiliary-Dependent Subsampling and Its Statistical Inference , 2009, Journal of biopharmaceutical statistics.

[3]  Jiming Jiang,et al.  MAXIMUM POSTERIOR ESTIMATION OF RANDOM EFFECTS IN GENERALIZED LINEAR MIXED MODELS , 2001 .

[4]  Charles E McCulloch,et al.  Estimation of covariate effects in generalized linear mixed models with a misspecified distribution of random intercepts and slopes , 2013, Statistics in medicine.

[5]  R. Pyke,et al.  Logistic disease incidence models and case-control studies , 1979 .

[6]  Xihong Lin,et al.  QUANTIFYING PQL BIAS IN ESTIMATING CLUSTER-LEVEL COVARIATE EFFECTS IN GENERALIZED LINEAR MIXED MODELS FOR GROUP-RANDOMIZED TRIALS , 2005 .

[7]  Zhen Chen,et al.  Outcome‐dependent sampling for longitudinal binary response data based on a time‐varying auxiliary variable , 2012, Statistics in medicine.

[8]  A. Bokhorst,et al.  Donor vigilance: a global update , 2014 .

[9]  R. Hines Fitting generalized linear models to retrospectively sampled clusters with categorical responses , 1997 .

[10]  J E White,et al.  A two stage design for the study of the relationship between a rare exposure and a rare disease. , 1982, American journal of epidemiology.

[11]  N. Breslow,et al.  Approximate inference in generalized linear mixed models , 1993 .

[12]  Easy come, easy go. Retention of blood donors , 2015, Transfusion medicine.

[13]  N. Breslow,et al.  Statistics in Epidemiology : The Case-Control Study , 2008 .

[14]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[15]  S. Wood Generalized Additive Models: An Introduction with R , 2006 .

[16]  Ming-Hui Chen,et al.  Statistical methods and computing for big data. , 2015, Statistics and its interface.

[17]  B. Custer,et al.  Vasovagal reactions in whole blood donors at three REDS‐II blood centers in Brazil , 2012, Transfusion.

[18]  Xiaoxiao Sun,et al.  Leveraging for big data regression , 2015 .

[19]  A. Scott,et al.  Fitting Logistic Models Under Case‐Control or Choice Based Sampling , 1986 .

[20]  Jonathan S Schildcrout,et al.  Longitudinal Studies of Binary Response Data Following Case–Control and Stratified Case–Control Sampling: Design and Analysis , 2010, Biometrics.

[21]  Yongdai Kim,et al.  Analysis of longitudinal data in case-control studies , 2004 .

[22]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[23]  Kerrie Mengersen,et al.  Principles of Experimental Design for Big Data Analysis. , 2017, Statistical science : a review journal of the Institute of Mathematical Statistics.

[24]  N. Breslow,et al.  Bias Correction in Generalized Linear Mixed Models with Multiple Components of Dispersion , 1996 .

[25]  Norman E. Breslow,et al.  Logistic regression for two-stage case-control data , 1988 .

[26]  Ping Ma,et al.  A statistical perspective on algorithmic leveraging , 2013, J. Mach. Learn. Res..