Can the buck always be passed to the highest level of clustering?

BackgroundClustering commonly affects the uncertainty of parameter estimates in epidemiological studies. Cluster-robust variance estimates (CRVE) are used to construct confidence intervals that account for single-level clustering, and are easily implemented in standard software. When data are clustered at more than one level (e.g. village and household) the level for the CRVE must be chosen. CRVE are consistent when used at the higher level of clustering (village), but since there are fewer clusters at the higher level, and consistency is an asymptotic property, there may be circumstances under which coverage is better from lower- rather than higher-level CRVE. Here we assess the relative importance of adjusting for clustering at the higher and lower level in a logistic regression model.MethodsWe performed a simulation study in which the coverage of 95 % confidence intervals was compared between adjustments at the higher and lower levels.ResultsConfidence intervals adjusted for the higher level of clustering had coverage close to 95 %, even when there were few clusters, provided that the intra-cluster correlation of the predictor was less than 0.5 for models with a single predictor and less than 0.2 for models with multiple predictors.ConclusionsWhen there are multiple levels of clustering it is generally preferable to use confidence intervals that account for the highest level of clustering. This only fails if there are few clusters at this level and the intra-cluster correlation of the predictor is high.

[1]  A. Scott,et al.  The Effect of Two-Stage Sampling on Ordinary Least Squares Methods , 1982 .

[2]  S. Zeger,et al.  Longitudinal data analysis using generalized linear models , 1986 .

[3]  Brent R. Moulton Random group effects and the precision of regression estimates , 1986 .

[4]  P. Albert,et al.  Models for longitudinal data: a generalized estimating equation approach. , 1988, Biometrics.

[5]  K. Liang,et al.  Marginal models for correlated binary responses with multiple classes and multiple levels of nesting. , 1992, Biometrics.

[6]  Barry McDonald,et al.  Estimating Logistic Regression Parameters for Bivariate Binary Data , 1993 .

[7]  G. Fitzmaurice,et al.  A caveat concerning independence estimating equations with multivariate binary data. , 1995, Biometrics.

[8]  M. Fay,et al.  Small‐Sample Adjustments for Wald‐Type Tests Using Sandwich Estimators , 2001, Biometrics.

[9]  T. Derouen,et al.  A Covariance Estimator for GEE with Improved Small‐Sample Properties , 2001, Biometrics.

[10]  W. Pan,et al.  Small‐sample adjustments in using the sandwich variance estimator in generalized estimating equations , 2002, Statistics in medicine.

[11]  Daniel F. McCaffrey,et al.  Bias reduction in standard errors for linear regression with multi-stage samples , 2002 .

[12]  Sandra Eldridge,et al.  Patterns of intra-cluster correlation from primary care research to inform study design and analysis. , 2004, Journal of clinical epidemiology.

[13]  R. Bell,et al.  Improved hypothesis testing for coefficients in generalized estimating equations with small samples of clusters , 2006, Statistics in medicine.

[14]  Edward C Chao,et al.  Structured correlation in models for clustered data , 2006, Statistics in medicine.

[15]  Joshua D. Angrist,et al.  Mostly Harmless Econometrics: An Empiricist's Companion , 2008 .

[16]  John B. Carlin,et al.  The Intra‐Cluster Correlation Coefficient in Cluster Randomized Trials: A Review of Definitions , 2009 .

[17]  D. Conway,et al.  Effect of two different house screening interventions on exposure to malaria vectors and on anaemia in children in The Gambia: a randomised controlled trial , 2009, The Lancet.

[18]  M. Puumala,et al.  Optimal combination of estimating equations in the analysis of multilevel nested correlated data , 2010, Statistics in medicine.

[19]  N. Jewell,et al.  To GEE or Not to GEE: Comparing Population Average and Mixed Models for Estimating the Associations Between Neighborhood Risk Factors and Health , 2010, Epidemiology.

[20]  E. Steyerberg,et al.  [Regression modeling strategies]. , 2011, Revista espanola de cardiologia.

[21]  A. Young Mostly Harmless Econometrics , 2012 .

[22]  Douglas L. Miller,et al.  A Practitioner’s Guide to Cluster-Robust Inference , 2015, The Journal of Human Resources.