Logistic regression for clustered data from environmental monitoring programs

Large-scale surveys, such as national forest inventories and vegetation monitoring programs, usually have complex sampling designs that include geographical stratification and units organized in clusters. When models are developed using data from such programs, a key question is whether or not to utilize design information when analyzing the relationship between a response variable and a set of covariates. Standard statistical regression methods often fail to account for complex sampling designs, which may lead to severely biased estimators of model coefficients. Furthermore, ignoring that data are spatially correlated within clusters may underestimate the standard errors of regression coefficient estimates, with a risk for drawing wrong conclusions. We first review general approaches that account for complex sampling designs, e.g. methods using probability weighting, and stress the need to explore the effects of the sampling design when applying logistic regression models. We then use Monte Carlo simulation to compare the performance of the standard logistic regression model with two approaches to model correlated binary responses, i.e. cluster-specific and population-averaged logistic regression models. As an example, we analyze the occurrence of epiphytic hair lichens in the genus Bryoria; an indicator of forest ecosystem integrity. Based on data from the National Forest Inventory (NFI) for the period 1993–2014 we generated a data set on hair lichen occurrence on > 100,000 Picea abies trees distributed throughout Sweden. The NFI data included ten covariates representing forest structure and climate variables potentially affecting lichen occurrence. Our analyses show the importance of taking complex sampling designs and correlated binary responses into account in logistic regression modeling to avoid the risk of obtaining notably biased parameter estimators and standard errors, and erroneous interpretations about factors affecting e.g. hair lichen occurrence. We recommend comparisons of unweighted and weighted logistic regression analyses as an essential step in development of models based on data from large-scale surveys.

[1]  D. Pfeffermann The Role of Sampling Weights when Modeling Survey Data , 1993 .

[2]  Sharon L. Lohr,et al.  A comparison of weighted and unweighted analyses in the national Crime Victimization Survey , 1994 .

[3]  David B. Lindenmayer,et al.  The science and application of ecological monitoring , 2010 .

[4]  J. Brunet,et al.  Trait variations of ground flora species disentangle the effects of global change and altered land‐use in Swedish forests during 20 years , 2016, Global change biology.

[5]  Danny Pfeffermann,et al.  Inference under informative sampling , 2009 .

[6]  S. Rabe-Hesketh,et al.  Multilevel modelling of complex survey data , 2006 .

[7]  L. Held,et al.  Marginal or conditional regression models for correlated non‐normal data? , 2016 .

[8]  C. Ellis Lichen epiphyte diversity: A species, community and trait-based review , 2012 .

[9]  Göran Ståhl,et al.  Adapting National Forest Inventories to changing requirements - the case of the Swedish National Forest Inventory at the turn of the 20th century , 2014 .

[10]  E L Korn,et al.  Epidemiologic studies utilizing surveys: accounting for the sampling design. , 1991, American journal of public health.

[11]  Alan Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[12]  Tihomir Asparouhov,et al.  General Multi-Level Modeling with Sampling Weights , 2006 .

[13]  J. Hardin,et al.  Generalized Estimating Equations , 2002 .

[14]  J. Fieberg,et al.  Regression modelling of correlated data in ecology: subject‐specific and population averaged response patterns , 2009 .

[15]  One-Step Generalized Estimating Equations in complex surveys with large cluster sizes with application to the United State ’ s Nationwide Inpatient Sample , 2014 .

[16]  John A. Nelder,et al.  Conditional and Marginal Models: Another View , 2004 .

[17]  Edward L. Korn,et al.  Examples of Differing Weighted and Unweighted Estimates from a Sample Survey , 1995 .

[18]  P. Heagerty,et al.  Misspecified maximum likelihood estimates and generalised linear mixed models , 2001 .

[19]  S. Zeger,et al.  Longitudinal data analysis using generalized linear models , 1986 .

[20]  G. Ståhl,et al.  Broad-scale distribution of epiphytic hair lichens correlates more with climate and nitrogen deposition than with forest structure , 2016 .

[21]  Jeffrey R. Wilson,et al.  Modeling Binary Correlated Responses using SAS, SPSS and R , 2015 .

[22]  Debajyoti Sinha,et al.  One-Step Generalized Estimating Equations With Large Cluster Sizes , 2017, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[23]  R. O’Hara,et al.  Quantifying Habitat Requirements of Tree‐Living Species in Fragmented Boreal Forests with Bayesian Methods , 2009, Conservation biology : the journal of the Society for Conservation Biology.

[24]  D. Bates,et al.  Fitting Linear Mixed-Effects Models Using lme4 , 2014, 1406.5823.

[25]  F B Hu,et al.  Comparison of population-averaged and subject-specific approaches for analyzing repeated binary outcomes. , 1998, American journal of epidemiology.

[26]  P. Albert,et al.  Models for longitudinal data: a generalized estimating equation approach. , 1988, Biometrics.

[27]  Andrea Rotnitzky,et al.  Regression Models for Discrete Longitudinal Responses , 1993 .

[28]  Søren Højsgaard,et al.  The R Package geepack for Generalized Estimating Equations , 2005 .

[29]  Jerome P. Reiter,et al.  Analytical Modeling in Complex Surveys of Work Practices , 2005 .

[30]  Patricia A. Berglund,et al.  Applied Survey Data Analysis , 2010 .