Modeling sparsely clustered data: design-based, model-based, and single-level methods.

Recent studies have investigated the small sample properties of models for clustered data, such as multilevel models and generalized estimating equations. These studies have focused on parameter bias when the number of clusters is small, but very few studies have addressed the methods' properties with sparse data: a small number of observations within each cluster. In particular, studies have yet to address the properties of generalized estimating equations, a possible alternative to multilevel models often overlooked in behavioral sciences, with sparse data. This article begins with a discussion of population-averaged and cluster-specific models, provides a brief overview of both multilevel models and generalized estimating equations, and then conducts a simulation study on the sparse data properties of generalized estimating equations, multilevel models, and single-level regression models for both normal and binary outcomes. The simulation found generalized estimating equations estimate regression coefficients and their standard errors without bias with as few as 2 observations per cluster, provided that the number of clusters was reasonably large. Similar to the previous studies, multilevel models tended to overestimate the between-cluster variance components when the cluster size was below about 5.

[1]  Ana V. Diez Roux Multilevel analysis in public health research , 2000 .

[2]  J Merlo,et al.  Hazardous effects of tobacco industry funding , 2003, Journal of epidemiology and community health.

[3]  A. Adimora,et al.  Sex ratio, poverty, and concurrent partnerships among men and women in the United States: a multilevel analysis. , 2013, Annals of epidemiology.

[4]  D. Stram,et al.  Variance components testing in the longitudinal mixed effects model. , 1994, Biometrics.

[5]  J. Milyo,et al.  Estimating the Impact of State Policies and Institutions with Mixed-Level Data , 2007, State Politics & Policy Quarterly.

[6]  Noreen Goldman,et al.  An assessment of estimation procedures for multilevel models with binary responses , 1995 .

[7]  B. Muthén,et al.  How to Use a Monte Carlo Study to Decide on Sample Size and Determine Power , 2002 .

[8]  Charles A. Scherbaum,et al.  Estimating Statistical Power and Required Sample Sizes for Organizational Research Using Multilevel Modeling , 2009 .

[9]  Philippa Clarke,et al.  Addressing Data Sparseness in Contextual Population Research , 2007 .

[10]  A Gelman,et al.  A case study on the choice, interpretation and checking of multilevel models for longitudinal binary outcomes. , 2001, Biostatistics.

[11]  D. Bates,et al.  Approximations to the Log-Likelihood Function in the Nonlinear Mixed-Effects Model , 1995 .

[12]  David Sloan Wilson,et al.  Community perception: the ability to assess the safety of unfamiliar neighborhoods and respond adaptively. , 2011, Journal of personality and social psychology.

[13]  L. Råstam,et al.  Diastolic blood pressure and area of residence: multilevel versus ecological analysis of social inequity , 2001, Journal of epidemiology and community health.

[14]  T. Derouen,et al.  A Covariance Estimator for GEE with Improved Small‐Sample Properties , 2001, Biometrics.

[15]  John M. Ferron,et al.  How Low Can You Go? An Investigation of the Influence of Sample Size and Model Complexity on Point and Interval Estimates in Two-Level Linear Models , 2014 .

[16]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[17]  D. Spini,et al.  An Introduction to Generalized Estimating Equations and an Application to Assess Selectivity Effects in a Longitudinal Study on Very Old Individuals , 2004 .

[18]  L Nyström,et al.  Statistical Analysis , 2008, Encyclopedia of Social Network Analysis and Mining.

[19]  K. Sundquist,et al.  Neighborhood deprivation and inequities in coronary heart disease among patients with diabetes mellitus: a multilevel study of 334,000 patients. , 2012, Health & place.

[20]  Joop J. Hox,et al.  Multilevel modeling: When and why , 1998 .

[21]  D. Flora,et al.  An empirical evaluation of alternative methods of estimation for confirmatory factor analysis with ordinal data. , 2004, Psychological methods.

[22]  Andrea Rotnitzky,et al.  Regression Models for Discrete Longitudinal Responses , 1993 .

[23]  K. Matthews,et al.  Individual and Neighborhood Socioeconomic Status and Inflammation in Mexican American Women: What Is the Role of Obesity? , 2012, Psychosomatic medicine.

[24]  L. Hedges,et al.  Intraclass Correlation Values for Planning Group-Randomized Trials in Education , 2007 .

[25]  J. Singer,et al.  Applied Longitudinal Data Analysis , 2003 .

[26]  P. Clarke,et al.  When can group level clustering be ignored? Multilevel models versus single-level models with sparse data , 2008, Journal of Epidemiology & Community Health.

[27]  J. Gerring A case study , 2011, Technology and Society.

[28]  N. Neerchal,et al.  Small Sample Correction for the Variance of GEE Estimators , 2003 .

[29]  Daniel J Bauer,et al.  Fitting multilevel models with ordinal outcomes: performance of alternative specifications and methods of estimation. , 2011, Psychological methods.

[30]  J. Hanley,et al.  Statistical analysis of correlated data using generalized estimating equations: an orientation. , 2003, American journal of epidemiology.

[31]  N. Jewell,et al.  To GEE or Not to GEE: Comparing Population Average and Mixed Models for Estimating the Associations Between Neighborhood Risk Factors and Health , 2010, Epidemiology.

[32]  Kristin L. Sainani,et al.  Logistic Regression , 2014, PM & R : the journal of injury, function, and rehabilitation.

[33]  G. Bieler,et al.  Cluster sampling techniques in quantal response teratology and developmental toxicity studies. , 1995, Biometrics.

[34]  Bradley P Carlin,et al.  Impact of small group size on neighbourhood influences in multilevel models , 2010, Journal of Epidemiology & Community Health.

[35]  A. Satorra,et al.  Complex Sample Data in Structural Equation Modeling , 1995 .

[36]  R. Moineddin,et al.  A simulation study of sample size for multilevel logistic regression models , 2007, BMC medical research methodology.

[37]  S. Zeger,et al.  Longitudinal data analysis using generalized linear models , 1986 .

[38]  Christopher Zorn Generalized Estimating Equation Models for Correlated Data: A Review with Applications , 2001 .

[39]  Gary A. Ballinger,et al.  Using Generalized Estimating Equations for Longitudinal Data Analysis , 2004 .

[40]  Roel Bosker,et al.  Standard Errors and Sample Sizes for Two-Level Research , 1993 .

[41]  P. Burton,et al.  Extending the simple linear regression model to account for correlated responses: an introduction to generalized estimating equations and multi-level mixed modelling. , 1998, Statistics in medicine.

[42]  Geert Verbeke,et al.  MEANINGFUL STATISTICAL MODEL FORMULATIONS FOR REPEATED MEASURES , 2004 .

[43]  L. Lengua,et al.  Temperament as a moderator of the relation between neighborhood and children's adjustment. , 2010, Journal of applied developmental psychology.

[44]  Jan de Leeuw,et al.  Questioning Multilevel Models , 1995 .

[45]  Yoonsang Kim,et al.  Logistic Regression With Multiple Random Effects: A Simulation Study of Estimation Methods and Statistical Packages , 2013, The American statistician.

[46]  J. Hox,et al.  Sufficient Sample Sizes for Multilevel Modeling , 2005 .

[47]  Basile Chaix,et al.  A brief conceptual tutorial of multilevel analysis in social epidemiology: linking the statistical concept of clustering to the idea of contextual phenomenon , 2005, Journal of Epidemiology and Community Health.

[48]  Zhehui Luo,et al.  Fixed effects, random effects and GEE: What are the differences? , 2009, Statistics in medicine.

[49]  Anthony S. Bryk,et al.  Hierarchical Linear Models: Applications and Data Analysis Methods , 1992 .

[50]  A. Bohnert,et al.  Examining the Potential of Community-Based After-School Programs for Latino Youth , 2010, American journal of community psychology.

[51]  J. Ware,et al.  Applied Longitudinal Analysis , 2004 .