Three points to consider when choosing a LM or GLM test for count data

Summary The two most common approaches for analysing count data are to use a generalized linear model (GLM), or transform data, and use a linear model (LM). The latter has recently been advocated to more reliably maintain control of type I error rates in tests for no association, while seemingly losing little in power. We make three points on this issue. Point 1 – Choice of statistical model should primarily be made on the grounds of data properties. Choice of testing procedure should be considered and addressed as a separate issue, after model choice. If models with the appropriate data properties nonetheless have statistical problems such as type I error control (i.e. type I error rate greatly exceeds the intended significance level), the best solution is to keep the model but fix the problems. Point 2 – When a test has problems with type I error control, it can usually be corrected, but this may require departure from software default approaches. In particular, resampling is a good solution for small samples that can be easy to implement. Point 3 –Tests based on models that better fit the data (e.g. a negative binomial for overdispersed count data) tend to have better power properties and in some instances have considerably higher power. We illustrate these issues for a 2 × 2 experiment with a count response. This seemingly simple problem becomes hard when the experimental design is unbalanced, and software default procedures using LMs or GLMs can have difficulties, although in both cases the issues can be fixed. We conclude that, when GLMs are thought to fit count data well, and when any necessary steps are taken to correct type I error rates, they should be used rather than LMs. Nonetheless, standard LM tests are often robust and can have good type I error control, so there is an argument for their use for counts when diagnostics are difficult and statistical models are complex, although at some risk of loss of power and interpretability.

[1]  Scott D. Foster,et al.  Model based grouping of species across environmental gradients , 2011 .

[2]  J. V. Ver Hoef,et al.  Quasi-Poisson vs. negative binomial regression: how should we model overdispersed count data? , 2007, Ecology.

[3]  Thomas W. Yee,et al.  Vector Generalized Linear and Additive Models , 2015 .

[4]  David I. Warton,et al.  Many zeros does not mean zero inflation: comparing the goodness‐of‐fit of parametric models to multivariate abundance data , 2005 .

[5]  Anthony R. Ives,et al.  Generalized linear mixed models for phylogenetic analyses of community structure , 2011 .

[6]  Pravin K. Trivedi,et al.  Regression Analysis of Count Data , 1998 .

[7]  J. Andrew Royle,et al.  Hierarchical Modeling and Inference in Ecology: The Analysis of Data from Populations, Metapopulations and Communities , 2008 .

[8]  James C. Stegen,et al.  PERSPECTIVE Navigating the multiple meanings of b diversity: a roadmap for the practicing ecologist , 2010 .

[9]  Jakub Stoklosa,et al.  Model-based thinking for community ecology , 2014, Plant Ecology.

[10]  D. Ruppert,et al.  Transformation and Weighting in Regression , 1988 .

[11]  David J. Harris Generating realistic assemblages with a joint species distribution model , 2015 .

[12]  E. Atwill,et al.  Comanaging fresh produce for nature conservation and food safety , 2015, Proceedings of the National Academy of Sciences.

[13]  R. Schäfer,et al.  Ecotoxicology is not normal , 2015, Environmental Science and Pollution Research.

[14]  Sara Taskinen,et al.  Model‐based approaches to unconstrained ordination , 2015 .

[15]  A. Ives For testing the significance of regression coefficients, go ahead and log‐transform count data , 2015 .

[16]  S. Wood Generalized Additive Models: An Introduction with R , 2006 .

[17]  R. O’Hara,et al.  Do not log‐transform count data , 2010 .

[18]  Pierre Legendre,et al.  Beta diversity as the variance of community data: dissimilarity coefficients and partitioning. , 2013, Ecology letters.

[19]  Andreas Lindén,et al.  Using the negative binomial distribution to model overdispersion in ecological count data. , 2011, Ecology.

[20]  J. Hilbe Negative Binomial Regression: Preface , 2007 .

[21]  Jonathan M. Chase,et al.  Navigating the multiple meanings of β diversity: a roadmap for the practicing ecologist. , 2011, Ecology letters.

[22]  J. Lawless Negative binomial and mixed Poisson regression , 1987 .

[23]  Thomas W Yee,et al.  Constrained additive ordination. , 2006, Ecology.

[24]  G. Simpson,et al.  Functions for Generating Restricted Permutations of Data , 2015 .

[25]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[26]  B. M. Brown,et al.  Distribution‐Free Methods in Regression1 , 1982 .

[27]  Anthony C. Davison,et al.  Bootstrap Methods and Their Application , 1998 .

[28]  Yi Wang,et al.  mvabund– an R package for model‐based analysis of multivariate abundance data , 2012 .

[29]  M. Væth,et al.  On the use of Wald's test in exponential families , 1985 .

[30]  William N. Venables,et al.  Modern Applied Statistics with S-Plus. , 1996 .

[31]  Thomas W. Yee,et al.  Vector Generalized Linear and Additive Models: With an Implementation in R , 2015 .

[32]  Peter K. Dunn,et al.  Randomized Quantile Residuals , 1996 .