Logistic Methods for Resource Selection Functions and Presence-Only Species Distribution Models

In order to better protect and conserve biodiversity, ecologists use machine learning and statistics to understand how species respond to their environment and to predict how they will respond to future climate change, habitat loss and other threats. A fundamental modeling task is to estimate the probability that a given species is present in (or uses) a site, conditional on environmental variables such as precipitation and temperature. For a limited number of species, survey data consisting of both presence and absence records are available, and can be used to fit a variety of conventional classification and regression models. For most species, however, the available data consist only of occurrence records — locations where the species has been observed. In two closely-related but separate bodies of ecological literature, diverse special-purpose models have been developed that contrast occurrence data with a random sample of available environmental conditions. The most widespread statistical approaches involve either fitting an exponential model of species' conditional probability of presence, or fitting a naive logistic model in which the random sample of available conditions is treated as absence data; both approaches have well-known drawbacks, and do not necessarily produce valid probabilities. After summarizing existing methods, we overcome their drawbacks by introducing a new scaled binomial loss function for estimating an underlying logistic model of species presence/absence. Like the Expectation-Maximization approach of Ward et al. and the method of Steinberg and Cardell, our approach requires an estimate of population prevalence, Pr(y = 1), since prevalence is not identifiable from occurrence data alone. In contrast to the latter two methods, our loss function is straightforward to integrate into a variety of existing modeling frameworks such as generalized linear and additive models and boosted regression trees. We also demonstrate that approaches by Lele and Keim and by Lancaster and Imbens that surmount the identifiability issue by making parametric data assumptions do not typically produce valid probability estimates.

[1]  Eliot J. B. McIntire,et al.  False negatives—A false problem in studies of habitat selection? , 2010 .

[2]  J Elith,et al.  A working guide to boosted regression trees. , 2008, The Journal of animal ecology.

[3]  Miroslav Dudík,et al.  Generative and Discriminative Learning with Unknown Labeling Bias , 2008, NIPS.

[4]  Miroslav Dudík,et al.  Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation , 2008 .

[5]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[6]  B. Manly,et al.  Resource selection by animals: statistical design and analysis for field studies. , 1994 .

[7]  Dan Steinberg,et al.  Estimating logistic regression models when the dependent variable has no variance , 1992 .

[8]  M. Austin Spatial prediction of species distribution: an interface between ecological theory and statistical modelling , 2002 .

[9]  H. Pulliam On the relationship between niche and distribution , 2000 .

[10]  Alberto Jiménez-Valverde,et al.  Not as good as they seem: the importance of concepts in species distribution modelling , 2008 .

[11]  Robert P. Anderson,et al.  Maximum entropy modeling of species geographic distributions , 2006 .

[12]  S. Cherry,et al.  USE AND INTERPRETATION OF LOGISTIC REGRESSION IN HABITAT-SELECTION STUDIES , 2004 .

[13]  G. Imbens,et al.  Case-control studies with contaminated controls☆ , 1996 .

[14]  Jason Matthiopoulos,et al.  The interpretation of habitat preference metrics under use–availability designs , 2010, Philosophical Transactions of the Royal Society B: Biological Sciences.

[15]  Michael Drielsma,et al.  Extended statistical approaches to modelling spatial pattern in biodiversity in northeast New South Wales. II. Community-level modelling , 2002, Biodiversity & Conservation.

[16]  Chris J. Johnson,et al.  Resource Selection Functions Based on Use–Availability Data: Theoretical Motivation and Evaluation Methods , 2006 .

[17]  S. Lele A New Method for Estimation of Resource Selection Probability Function , 2009 .

[18]  Subhash R Lele,et al.  Weighted distributions and estimation of resource selection probability functions. , 2006, Ecology.

[19]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[20]  A. Townsend Peterson,et al.  Novel methods improve prediction of species' distributions from occurrence data , 2006 .

[21]  T. Hastie,et al.  Presence‐Only Data and the EM Algorithm , 2009, Biometrics.

[22]  Peter Buhlmann,et al.  BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING , 2007, 0804.2752.

[23]  Steven J. Phillips,et al.  Sample selection bias and presence-only distribution models: implications for background and pseudo-absence data. , 2009, Ecological applications : a publication of the Ecological Society of America.

[24]  A. Lehmann,et al.  Predicting species spatial distributions using presence-only data: a case study of native New Zealand ferns , 2002 .

[25]  S. Ferrier,et al.  Extended statistical approaches to modelling spatial pattern in biodiversity in northeast New South Wales. I. Species-level modelling , 2004, Biodiversity & Conservation.

[26]  Jennifer A. Miller,et al.  Mapping Species Distributions: Spatial Inference and Prediction , 2010 .