Ecological prediction at macroscales using big data: Does sampling design matter?

Although ecosystems respond to global change at regional to continental scales (i.e., macroscales), model predictions of ecosystem responses often rely on data from targeted monitoring of a small proportion of sampled ecosystems within a particular geographic area. In this study, we examined how the sampling strategy used to collect data for such models influences predictive performance. We subsampled a large and spatially-extensive dataset to investigate how macroscale sampling strategy affects prediction of ecosystem characteristics in 6,784 lakes across a 1.8 million km2 area. We estimated model predictive performance for different subsets of the dataset to mimic three common sampling strategies for collecting observations of ecosystem characteristics: random sampling design, stratified random sampling design, and targeted sampling. We found that sampling strategy influenced model predictive performance such that (1) stratified random sampling designs did not improve predictive performance compared to simple random sampling designs and (2) although one of the scenarios that mimicked targeted (non-random) sampling had the poorest performing predictive models, the other targeted sampling scenarios resulted in models with similar predictive performance to that of the random sampling scenarios. Our results suggest that although potential biases in datasets from some forms of targeted sampling may limit predictive performance, compiling existing spatially-extensive datasets can result in models with good predictive performance that may inform a wide range of science questions and policy goals related to global change.

[1]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[2]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[3]  T. Swetnam Fire History and Climate Change in Giant Sequoia Groves , 1993, Science.

[4]  Steven G. Paulsen,et al.  MONITORING FOR POLICY-RELEVANT REGIONALTRENDS OVER TIME , 1998 .

[5]  Walter Liggett,et al.  Statistical Issues for Monitoring Ecological and Natural Resources in the United States , 1999 .

[6]  Sharon L. Lohr,et al.  Sampling: Design and Analysis , 1999 .

[7]  W. B. Smith,et al.  Forest inventory and analysis: a national inventory and monitoring program. , 2002, Environmental pollution.

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  JAMES R. MILLER,et al.  Spatial Extrapolation: The Science of Predicting Ecological Patterns and Processes , 2004 .

[10]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[11]  Jay M. Ver Hoef,et al.  Spatial methods for plot-based sampling of wildlife populations , 2008, Environmental and Ecological Statistics.

[12]  M. Rask,et al.  Fish‐based assessment of ecological status of Finnish lakes loaded by diffuse nutrient pollution from agriculture , 2010 .

[13]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[14]  Zhi-Hua Zhou,et al.  Ensemble Methods: Foundations and Algorithms , 2012 .

[15]  Steven K. Thompson,et al.  Sampling: Thompson/Sampling 3E , 2012 .

[16]  Amy,et al.  CONTENT ASSESSMENT OF THE PRIMARY BIODIVERSITY DATA PUBLISHED THROUGH GBIF NETWORK : STATUS , CHALLENGES AND POTENTIALS , 2013 .

[17]  P. Soranno,et al.  Multi-scaled drivers of ecosystem state: quantifying the importance of the regional spatial scale. , 2013, Ecological applications : a publication of the Ecological Society of America.

[18]  William A. Link,et al.  The North American Breeding Bird Survey 1966–2011: Summary Analysis and Species Accounts , 2013 .

[19]  D. Cahoon,et al.  A global standard for monitoring coastal wetland vulnerability to accelerated sea-level rise , 2013 .

[20]  P. Soranno,et al.  Macrosystems ecology: understanding ecological patterns and processes at continental scales , 2014 .

[21]  Pang-Ning Tan,et al.  Building a multi-scaled geospatial temporal ecology database from disparate data sources: fostering open science and data reuse , 2015, GigaScience.

[22]  Aline Jaimes,et al.  The importance of lake-specific characteristics for water quality across the continental United States. , 2015, Ecological applications : a publication of the Ecological Society of America.

[23]  Peter L. Boveng,et al.  On Extrapolating Past the Range of Observed Data When Making Statistical Predictions in Ecology , 2015, PloS one.

[24]  Eve-Lyn S. Hinckley,et al.  Introduction to the sampling designs of the National Ecological Observatory Network Terrestrial Observation System , 2016 .

[25]  Tyler Wagner,et al.  Lake nutrient stoichiometry is less predictable than nutrient concentrations at regional and sub-continental scales. , 2017, Ecological applications : a publication of the Ecological Society of America.

[26]  W. W. Jones,et al.  LAGOS-NE: a multi-scaled geospatial and temporal database of lake ecological context and water quality for thousands of US lakes , 2017, GigaScience.

[27]  Mevin B Hooten,et al.  Iterative near-term ecological forecasting: Needs, opportunities, and challenges , 2018, Proceedings of the National Academy of Sciences.

[28]  Heather Savoy,et al.  An Integrated View of Complex Landscapes: A Big Data-Model Integration Approach to Transdisciplinary Science , 2018, BioScience.

[29]  Janneke HilleRisLambers,et al.  The International Tree‐Ring Data Bank (ITRDB) revisited: Data availability and global ecological representativity , 2018, Journal of Biogeography.

[30]  Tyler Wagner,et al.  Combining nutrient, productivity, and landscape‐based regressions improves predictions of lake nutrients and provides insight into nutrient coupling at macroscales , 2018, Limnology and Oceanography.

[31]  Sarah M. Collins,et al.  Similarity in spatial structure constrains ecosystem relationships: Building a macroscale understanding of lakes , 2018, Global Ecology and Biogeography.

[32]  Jiayu Zhou,et al.  Increasing accuracy of lake nutrient predictions in thousands of lakes by leveraging water clarity data , 2019, Limnology and Oceanography Letters.

[33]  William M Janousek,et al.  Disentangling monitoring programs: design, analysis, and application considerations. , 2019, Ecological applications : a publication of the Ecological Society of America.

[34]  Tyler Wagner,et al.  Spatial and temporal variation of ecosystem properties at macroscales. , 2019, Ecology letters.

[35]  Samantha K. Oliver,et al.  Biases in lake water quality sampling and implications for macroscale research , 2019, Limnology and Oceanography.

[36]  Kevin C Elliott,et al.  Quantifying the contribution of citizen science to broad‐scale ecological databases , 2019, Frontiers in Ecology and the Environment.

[37]  Tyler Wagner,et al.  Identifying and characterizing extrapolation in multivariate response data , 2019, PloS one.