Computationally efficient joint species distribution modeling of big spatial data

Abstract The ongoing global change and the increased interest in macroecological processes call for the analysis of spatially extensive data on species communities to understand and forecast distributional changes of biodiversity. Recently developed joint species distribution models can deal with numerous species efficiently, while explicitly accounting for spatial structure in the data. However, their applicability is generally limited to relatively small spatial data sets because of their severe computational scaling as the number of spatial locations increases. In this work, we propose a practical alleviation of this scalability constraint for joint species modeling by exploiting two spatial‐statistics techniques that facilitate the analysis of large spatial data sets: Gaussian predictive process and nearest‐neighbor Gaussian process. We devised an efficient Gibbs posterior sampling algorithm for Bayesian model fitting that allows us to analyze community data sets consisting of hundreds of species sampled from up to hundreds of thousands of spatial units. The performance of these methods is demonstrated using an extensive plant data set of 30,955 spatial units as a case study. We provide an implementation of the presented methods as an extension to the hierarchical modeling of species communities framework.

[1]  Dorit Hammerling,et al.  A Case Study Competition Among Methods for Analyzing Large Spatial Data , 2017, Journal of Agricultural, Biological and Environmental Statistics.

[2]  David B. Dunson,et al.  Using latent variable models to identify large networks of species‐to‐species associations at different spatial scales , 2016 .

[3]  David B. Roy,et al.  Uncovering hidden spatial structure in species communities with spatially explicit joint species distribution models , 2016 .

[4]  A. Peterson,et al.  New developments in museum-based informatics and applications in biodiversity analysis. , 2004, Trends in ecology & evolution.

[5]  David B. Dunson,et al.  Scaling up Data Augmentation MCMC via Calibration , 2017, J. Mach. Learn. Res..

[6]  James Hensman,et al.  Scalable Variational Gaussian Process Classification , 2014, AISTATS.

[7]  A. V. Vecchia A New Method of Prediction for Spatial Regression Models with Correlated Errors , 1992 .

[8]  Michael A. West,et al.  BAYESIAN MODEL ASSESSMENT IN FACTOR ANALYSIS , 2004 .

[9]  Matt White,et al.  Useful surrogates of soil texture for plant ecologists from airborne gamma‐ray detection , 2018, Ecology and evolution.

[10]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[11]  A. Gelfand,et al.  Gaussian predictive process models for large spatial data sets , 2008, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[12]  Qian Ren,et al.  Hierarchical Factor Models for Large Spatially Misaligned Data: A Low‐Rank Predictive Process Approach , 2013, Biometrics.

[13]  Andrew O. Finley,et al.  spBayes for Large Univariate and Multivariate Point-Referenced Spatio-Temporal Data Models , 2013, 1310.8192.

[14]  Andrew O. Finley,et al.  Spatial Factor Models for High-Dimensional and Large Spatial Data: An Application in Forest Variable Mapping. , 2018, Statistica Sinica.

[15]  S. Chib,et al.  Bayesian analysis of binary and polychotomous response data , 1993 .

[16]  D. Dunson,et al.  Sparse Bayesian infinite factor models. , 2011, Biometrika.

[17]  David B. Dunson,et al.  Lognormal and Gamma Mixed Negative Binomial Regression , 2012, ICML.

[18]  Robert P Guralnick,et al.  Towards a collaborative, global infrastructure for biodiversity assessment , 2007, Ecology letters.

[19]  Hans J. Skaug,et al.  Spatial factor analysis: a new tool for estimating joint species distributions and correlations in species range , 2015 .

[20]  Kai Zhu,et al.  More than the sum of the parts: forest climate response from joint species distribution models. , 2014, Ecological applications : a publication of the Ecological Society of America.

[21]  Sudipto Banerjee,et al.  Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets , 2014, Journal of the American Statistical Association.

[22]  Anna Norberg,et al.  How to make more out of community data? A conceptual framework and its implementation as models and software. , 2017, Ecology letters.

[23]  Zhiyi Chi,et al.  Approximating likelihoods for large spatial data sets , 2004 .

[24]  Peter J. Diggle,et al.  Bayesian Geostatistical Design , 2006 .

[25]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[26]  Andrew O. Finley,et al.  Efficient Algorithms for Bayesian Nearest Neighbor Gaussian Processes , 2017, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[27]  Francis K. C. Hui,et al.  So Many Variables: Joint Modeling in Community Ecology. , 2015, Trends in ecology & evolution.

[28]  C. Rahbek,et al.  Spatial predictions at the community level: from current approaches to future frameworks , 2017, Biological reviews of the Cambridge Philosophical Society.

[29]  Andrew O. Finley,et al.  Improving the performance of predictive process modeling for large datasets , 2009, Comput. Stat. Data Anal..

[30]  David B. Dunson,et al.  Calibrated Data Augmentation for Scalable Markov Chain Monte Carlo , 2017 .

[31]  D. Dunson,et al.  Using joint species distribution models for evaluating how species‐to‐species associations depend on the environmental context , 2017 .

[32]  Marc G. Genton,et al.  Cross-Covariance Functions for Multivariate Geostatistics , 2015, 1507.08017.

[33]  Joseph Guinness,et al.  Permutation and Grouping Methods for Sharpening Gaussian Process Approximations , 2016, Technometrics.

[34]  A M Latimer,et al.  Hierarchical models facilitate spatial analysis of large data sets: a case study on invasive plant species in the northeastern United States. , 2009, Ecology letters.

[35]  Sudipto Banerjee,et al.  On nearest‐neighbor Gaussian process models for massive spatial data , 2016, Wiley interdisciplinary reviews. Computational statistics.

[36]  Helen M. Regan,et al.  Big data for forecasting the impacts of global change on plant communities , 2017 .

[37]  Sw. Banerjee,et al.  Hierarchical Modeling and Analysis for Spatial Data , 2003 .

[38]  Tue Tjur,et al.  Coefficients of Determination in Logistic Regression Models—A New Proposal: The Coefficient of Discrimination , 2009 .