Review of methods for handling confounding by cluster and informative cluster size in clustered data

Clustered data are common in medical research. Typically, one is interested in a regression model for the association between an outcome and covariates. Two complications that can arise when analysing clustered data are informative cluster size (ICS) and confounding by cluster (CBC). ICS and CBC mean that the outcome of a member given its covariates is associated with, respectively, the number of members in the cluster and the covariate values of other members in the cluster. Standard generalised linear mixed models for cluster-specific inference and standard generalised estimating equations for population-average inference assume, in general, the absence of ICS and CBC. Modifications of these approaches have been proposed to account for CBC or ICS. This article is a review of these methods. We express their assumptions in a common format, thus providing greater clarity about the assumptions that methods proposed for handling CBC make about ICS and vice versa, and about when different methods can be used in practice. We report relative efficiencies of methods where available, describe how methods are related, identify a previously unreported equivalence between two key methods, and propose some simple additional methods. Unnecessarily using a method that allows for ICS/CBC has an efficiency cost when ICS and CBC are absent. We review tools for identifying ICS/CBC. A strategy for analysis when CBC and ICS are suspected is demonstrated by examining the association between socio-economic deprivation and preterm neonatal death in Scotland.

[1]  Somnath Datta,et al.  Inference for marginal linear models for clustered longitudinal data with potentially informative cluster sizes , 2011, Statistical methods in medical research.

[2]  Somnath Datta,et al.  A Signed‐Rank Test for Clustered Data , 2008, Biometrics.

[3]  Ana Ivelisse Avilés,et al.  Linear Mixed Models for Longitudinal Data , 2001, Technometrics.

[4]  Harold I Feldman,et al.  Model Selection, Confounder Control, and Marginal Structural Models , 2004 .

[5]  S. Zeger,et al.  Longitudinal data analysis using generalized linear models , 1986 .

[6]  B. Coull,et al.  A diagnostic test for the mixing distribution in a generalised linear mixed model , 2006 .

[7]  P. Diggle Analysis of Longitudinal Data , 1995 .

[8]  S. Ratcliffe,et al.  Deviations from the population-averaged versus cluster-specific relationship for clustered binary data , 2004, Statistical methods in medical research.

[9]  Sander Greenland,et al.  A review of multilevel theory for ecologic analyses , 2002, Statistics in medicine.

[10]  Geert Verbeke,et al.  Conditional Linear Mixed Models , 2001 .

[11]  Zhen Chen,et al.  A joint modeling approach to data with informative cluster size: Robustness to the cluster size model , 2011, Statistics in medicine.

[12]  M Alan Brookhart,et al.  Analytic strategies to adjust confounding using exposure propensity scores and disease risk scores: nonsteroidal antiinflammatory drugs and short-term mortality in the elderly. , 2005, American journal of epidemiology.

[13]  G. Fitzmaurice,et al.  A caveat concerning independence estimating equations with multivariate binary data. , 1995, Biometrics.

[14]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[15]  Roger Sauter,et al.  In All Likelihood , 2002, Technometrics.

[16]  Babette A Brumback,et al.  Adjusting for confounding by neighborhood using generalized linear mixed models and complex survey data , 2013, Statistics in medicine.

[17]  Charles E. McCulloch,et al.  Separating between‐ and within‐cluster covariate effects by using conditional and partitioning methods , 2006 .

[18]  B. Leroux,et al.  Informative Cluster Sizes for Subcluster‐Level Covariates and Weighted Generalized Estimating Equations , 2011, Biometrics.

[19]  J. Concato,et al.  A simulation study of the number of events per variable in logistic regression analysis. , 1996, Journal of clinical epidemiology.

[20]  T R Ten Have,et al.  An Empirical Comparison of Several Clustered Data Approaches Under Confounding Due to Cluster Effects in the Analysis of Complications of Coronary Angioplasty , 1999, Biometrics.

[21]  Jan de Leeuw,et al.  The Effects of Different Forms of Centering in Hierarchical Linear Models , 2011 .

[22]  P. Heagerty,et al.  Misspecified maximum likelihood estimates and generalised linear mixed models , 2001 .

[23]  Michael K Parides,et al.  Separation of individual‐level and cluster‐level covariate effects in regression analysis of correlated data , 2003, Statistics in medicine.

[24]  Charles E McCulloch,et al.  Estimation of covariate effects in generalized linear mixed models with informative cluster sizes. , 2011, Biometrika.

[25]  J. Neyman,et al.  Consistent Estimates Based on Partially Consistent Observations , 1948 .

[26]  Somnath Datta,et al.  Rank-Sum Tests for Clustered Data , 2005 .

[27]  C. McCulloch,et al.  Misspecifying the Shape of a Random Effects Distribution: Why Getting It Wrong May Not Matter , 2011, 1201.1980.

[28]  J. Hausman Specification tests in econometrics , 1978 .

[29]  Patrick J Heagerty,et al.  Directly parameterized regression conditioning on being alive: analysis of longitudinal data truncated by deaths. , 2005, Biostatistics.

[30]  Thomas A. Louis,et al.  Matching conditional and marginal shapes in binary random intercept models using a bridge distribution function , 2003 .

[31]  David B Dunson,et al.  A Bayesian Approach for Joint Modeling of Cluster Size and Subunit‐Specific Outcomes , 2003, Biometrics.

[32]  S. Greenland Quantifying Biases in Causal Models: Classical Confounding vs Collider-Stratification Bias , 2003, Epidemiology.

[33]  E. Korn,et al.  Regression analysis with clustered data. , 1994, Statistics in medicine.

[34]  J. Kalbfleisch,et al.  Between- and within-cluster covariate effects in the analysis of clustered data. , 1998, Biometrics.

[35]  Patrick J Heagerty,et al.  A general framework for estimating volume‐outcome associations from longitudinal data , 2012, Statistics in medicine.

[36]  J. Pell,et al.  Trends in socioeconomic inequalities in risk of sudden infant death syndrome, other causes of infant mortality, and stillbirth in Scotland: population based study , 2012, BMJ : British Medical Journal.

[37]  H. Selvin Durkheim's Suicide and Problems of Empirical Research , 1958, American Journal of Sociology.

[38]  Tony Lancaster,et al.  Orthogonal Parameters and Panel Data , 2002 .

[39]  Andrew J Copas,et al.  Methods for Observed-Cluster Inference When Cluster Size Is Informative: A Review and Clarifications , 2014, Biometrics.

[40]  G. Verbeke,et al.  The effect of misspecifying the random-effects distribution in linear mixed models for longitudinal data , 1997 .

[41]  T. VanderWeele,et al.  Components of the indirect effect in vaccine trials: identification of contagion and infectiousness effects. , 2012, Epidemiology.

[42]  J. Kalbfleisch,et al.  A Comparison of Cluster-Specific and Population-Averaged Approaches for Analyzing Correlated Binary Data , 1991 .

[43]  Somnath Datta,et al.  Fitting marginal accelerated failure time models to clustered survival data with potentially informative cluster size , 2011, Comput. Stat. Data Anal..

[44]  Mark Von Tress,et al.  Generalized, Linear, and Mixed Models , 2003, Technometrics.

[45]  I G Kreft,et al.  The Effect of Different Forms of Centering in Hierarchical Linear Models. , 1995, Multivariate behavioral research.

[46]  Zhulin He,et al.  Adjusting for confounding by cluster using generalized linear mixed models , 2010 .

[47]  A. Copas,et al.  An examination of a method for marginal inference when the cluster size is informative , 2013 .

[48]  J. Hox,et al.  Robustness of parameter and standard error estimates against ignoring a contextual effect of a subject-level covariate in cluster-randomized trials , 2011, Behavior research methods.

[49]  M. Lesperance,et al.  Estimation efficiency in a binary mixed-effects model setting , 1996 .

[50]  J. N. K. Rao,et al.  Mean estimating equation approach to analysing cluster-correlated data with nonignorable cluster sizes , 2005 .

[51]  Charles E McCulloch,et al.  Estimation of covariate effects in generalized linear mixed models with a misspecified distribution of random intercepts and slopes , 2013, Statistics in medicine.

[52]  Hannu Oja,et al.  Inference on the marginal distribution of clustered data with informative cluster size , 2014, Statistical papers.

[53]  M. Pepe,et al.  A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data , 1994 .

[54]  Chin-Tsang Chiang,et al.  EFFICIENT ESTIMATION METHODS FOR INFORMATIVE CLUSTER SIZE DATA , 2008 .

[55]  W. S. Robinson Ecological correlations and the behavior of individuals. , 1950, International journal of epidemiology.

[56]  J. Shao,et al.  BETWEEN-AND WITHIN-CLUSTER COVARIATE EFFECTS AND MODEL MISSPECIFICATION IN THE ANALYSIS OF CLUSTERED DATA , 2008 .

[57]  Ralitza V Gueorguieva,et al.  Comments about Joint Modeling of Cluster Size and Binary and Continuous Subunit‐Specific Outcomes , 2005, Biometrics.

[58]  J. Nevalainen,et al.  A general class of signed-rank tests for clustered data when the cluster size is potentially informative , 2012, Journal of nonparametric statistics.

[59]  Somnath Datta,et al.  Marginal Analyses of Clustered Data When Cluster Size Is Informative , 2003, Biometrics.

[60]  S. Vansteelandt,et al.  Conditional Generalized Estimating Equations for the Analysis of Clustered and Longitudinal Data , 2008, Biometrics.

[61]  M Soledad Cepeda,et al.  Comparison of logistic regression versus propensity score when the number of events is low and there are multiple confounders. , 2003, American journal of epidemiology.

[62]  Geert Molenberghs,et al.  Shared‐Parameter Models , 2007 .

[63]  Brian D. M. Tom,et al.  Bias in 2-part mixed models for longitudinal semicontinuous data , 2009, Biostatistics.

[64]  B. Leroux,et al.  Efficiency of regression estimates for clustered data. , 1996, Biometrics.

[65]  Peter McCullagh,et al.  [Regression Models for Discrete Longitudinal Responses]: Comment , 1993 .

[66]  Pranab Kumar Sen,et al.  Within‐cluster resampling , 2001 .