Applying Nearest Neighbor Gaussian Processes to Massive Spatial Data Sets: Forest Canopy Height Prediction Across Tanana Valley Alaska

This manuscript addresses the needs for forest scientists to overcome computational hurdles associated with analyzing massive spatial datasets and answering complex inferential questions regarding underlying processes. The primary focus is on reparametrizations and alternate formulations of the recently proposed hierarchical Nearest Neighbor Gaussian Process (NNGP) models (Datta et al., 2016) for improved convergence, better run times, and more robust and reproducible Bayesian inference. Our specific application employs "Light Detection and Ranging" (LiDAR) data to deliver complete coverage forest canopy height prediction maps with associated uncertainty estimates. A major hurdle the very large number of spatial locations (in the order of a few millions). We offer detailed algorithms to ensure efficient CPU memory management and exploit high-performance numerical linear algebra for executing the analysis. Our substantive data analytic contributions pertain to fully process-based posterior inference to accommodate incomplete coverage information from LiDAR instruments, which are essential in advancing our understanding of forest structure and effectively monitoring forest resource dynamics over time. We assess the computational and inferential benefits of these alternate NNGP specifications using simulated data sets and LiDAR data collected over the US Forest Service Tanana Inventory Unit (TIU) in a remote portion of Interior Alaska. The resulting data product is the first statistically robust map of forest canopy for the TIU.