Fast and flexible Bayesian species distribution modelling using Gaussian processes

1. Species distribution modelling (SDM) is widely used in ecology, and predictions of species distributions inform both policy and ecological debates. Therefore, methods with high predictive accuracy and those that enable biological interpretation are preferable. Gaussian processes (GPs) are a highly flexible approach to statistical modelling and have recently been proposed for SDM. GP models fit smooth, but potentially complex response functions that can account for high-dimensional interactions between predictors. We propose fitting GP SDMs using deterministic numerical approximations, rather than Markov chain Monte Carlo methods in order to make GPs more computationally efficient and easy to use. 2. We introduce GP models and their application to SDM, illustrate how ecological knowledge can be incorporated into GP SDMs via Bayesian priors and formulate a simple GP SDM that can be fitted efficiently. This model can be fitted either by learning the hyperparameters or by using a fixed approximation to them. Using a subset of the North American Breeding Bird Survey data set, we compare the out-of-sample predictive accuracy of these models with several commonly used SDM approaches for both presence/absence and presence-only data. 3. Predictive accuracy of GP SDMs fitted by Laplace approximation was greater than boosted regression trees, generalized additive models (GAMs) and logistic regression when trained on presence/absence data and greater than all of these models plus MaxEnt when trained on presence-only data. GP SDMs fitted using a fixed approximation to hyperparameters were no less accurate than those with MAP estimation and on average 70 times faster, equivalent in speed to GAMs. 4. As well as having strong predictive power for this data set, GP SDMs offer a convenient method for incorporating prior knowledge of the species' ecology. By fitting these methods using efficient numerical approximations, they may easily be applied to large data sets and automatically for many species. An r package, GRaF, is provided to enable SDM users to fit GP models.

[1]  L. Grubhoffer,et al.  Hemelipoglycoprotein from the ornate sheep tick, dermacentor marginatus: structural and functional characterization , 2011, Parasites & Vectors.

[2]  J Elith,et al.  A working guide to boosted regression trees. , 2008, The Journal of animal ecology.

[3]  David J. Harris Generating realistic assemblages with a joint species distribution model , 2015 .

[4]  Neil D. Lawrence,et al.  Gaussian Processes for Big Data , 2013, UAI.

[5]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[6]  M. Kearney,et al.  Correlation and process in species distribution models: bridging a dichotomy , 2012 .

[7]  S. Richards,et al.  Prevalence, thresholds and the performance of presence–absence models , 2014 .

[8]  Robert P. Anderson,et al.  Maximum entropy modeling of species geographic distributions , 2006 .

[9]  T. Hastie,et al.  Presence‐Only Data and the EM Algorithm , 2009, Biometrics.

[10]  Trevor Hastie,et al.  A statistical explanation of MaxEnt for ecologists , 2011 .

[11]  M. McCarthy Bayesian Methods for Ecology , 2007 .

[12]  J. L. Parra,et al.  Very high resolution interpolated climate surfaces for global land areas , 2005 .

[13]  M. McCarthy Bayesian Methods for Ecology: Frontmatter , 2007 .

[14]  S. Munch,et al.  Combining a Bayesian nonparametric method with a hierarchical framework to estimate individual and temporal variation in growth , 2012 .

[15]  D. Warton,et al.  Correction note: Poisson point process models solve the “pseudo-absence problem” for presence-only data in ecology , 2010, 1011.3319.

[16]  Neil D. Lawrence,et al.  Sparse Convolved Gaussian Processes for Multi-output Regression , 2008, NIPS.

[17]  Finn Lindgren,et al.  Bayesian Spatial Modelling with R-INLA , 2015 .

[18]  Peter J. Diggle,et al.  Spatial and spatio-temporal Log-Gaussian Cox processes:extending the geostatistical paradigm , 2013, 1312.6536.

[19]  Antoine Guisan,et al.  Spatial modelling of biodiversity at the community level , 2006 .

[20]  T. Holy,et al.  Ultrasonic Songs of Male Mice , 2005, PLoS biology.

[21]  Jarno Vanhatalo,et al.  Species distribution modeling with Gaussian processes : A case study with the youngest stages of sea spawning whitefish (Coregonus lavaretus L. s.l.) larvae , 2012 .

[22]  Julian D. Olden,et al.  Assessing transferability of ecological models: an underappreciated aspect of statistical validation , 2012 .

[23]  J. Elith,et al.  Species Distribution Models: Ecological Explanation and Prediction Across Space and Time , 2009 .

[24]  Antoine Guisan,et al.  Predictive habitat distribution models in ecology , 2000 .

[25]  Xavier Robin,et al.  pROC: an open-source package for R and S+ to analyze and compare ROC curves , 2011, BMC Bioinformatics.

[26]  Gary Gereffi,et al.  Sustainable Product Indexing: Navigating the Challenge of Ecolabeling , 2010 .

[27]  J. Andrew Royle,et al.  Presence‐only modelling using MAXENT: when can we trust the inferences? , 2013 .

[28]  Hugh P. Possingham,et al.  How useful is expert opinion for predicting the distribution of a species within and beyond the region of expertise? A case study using brush-tailed rock-wallabies Petrogale penicillata , 2009 .

[29]  Jane Elith,et al.  On estimating probability of presence from use-availability or presence-background data. , 2013, Ecology.

[30]  Golding Nick GRaF: Fast, flexible Bayesian species distribution modelling using Gaussian random fields , 2013 .

[31]  Jennifer A. Miller Species Distribution Modeling , 2010 .

[32]  D. Rogers,et al.  The effects of species’ range sizes on the accuracy of distribution models: ecological phenomenon or statistical artefact? , 2004 .

[33]  Helen M. Regan,et al.  Mapping epistemic uncertainties and vague concepts in predictions of species distribution , 2002 .

[34]  A. Lehmann,et al.  Assessing New Zealand fern diversity from spatial predictions of species assemblages , 2002, Biodiversity & Conservation.

[35]  A. Rosenwald,et al.  Identification of Methylated Genes Associated with Aggressive Clinicopathological Features in Mantle Cell Lymphoma , 2011, PloS one.

[36]  Steven J. Phillips,et al.  Sample selection bias and presence-only distribution models: implications for background and pseudo-absence data. , 2009, Ecological applications : a publication of the Ecological Society of America.

[37]  R. Tibshirani,et al.  Generalized additive models for medical research , 1986, Statistical methods in medical research.

[38]  A. Townsend Peterson,et al.  Novel methods improve prediction of species' distributions from occurrence data , 2006 .

[39]  B. Huntley,et al.  Potential Impacts of Climatic Change on European Breeding Birds , 2008, PloS one.

[40]  Anand Patil Bayesian nonparametrics for inference of ecological dynamics , 2007 .

[41]  R. Real,et al.  AUC: a misleading measure of the performance of predictive distribution models , 2008 .

[42]  M. White,et al.  How Useful Are Species Distribution Models for Managing Biodiversity under Future Climates , 2010 .

[43]  Richard Fox,et al.  Direct and indirect effects of climate and habitat factors on butterfly diversity. , 2007, Ecology.

[44]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[45]  S. Wood Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models , 2011 .

[46]  J. Vanhatalo,et al.  Approximate inference for disease mapping with sparse Gaussian processes , 2010, Statistics in medicine.

[47]  Laura J. Pollock,et al.  Understanding co‐occurrence by modelling species simultaneously with a Joint Species Distribution Model (JSDM) , 2014 .

[48]  Joshua B. Tenenbaum,et al.  Structure Discovery in Nonparametric Regression through Compositional Kernel Search , 2013, ICML.

[49]  Caroline W. Kabaria,et al.  The dominant Anopheles vectors of human malaria in Africa, Europe and the Middle East: occurrence data, distribution maps and bionomic précis , 2010, Parasites & Vectors.

[50]  Jorge Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[51]  H. Rue,et al.  Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations , 2009 .

[52]  David L. Smith,et al.  Modelling the global constraints of temperature on transmission of Plasmodium falciparum and P. vivax , 2011, Parasites & Vectors.