ExaGeoStat: A High Performance Unified Framework for Geostatistics on Manycore Systems

We present ExaGeoStat, a high performance framework for geospatial statistics in climate and environment modeling. In contrast to simulation based on partial differential equations derived from first-principles modeling, ExaGeoStat employs a statistical model based on the evaluation of the Gaussian log-likelihood function, which operates on a large dense covariance matrix. Generated by the parametrizable Matern covariance function, the resulting matrix is symmetric and positive definite. The computational tasks involved during the evaluation of the Gaussian log-likelihood function become daunting as the number n of geographical locations grows, as O(n2) storage and O(n3) operations are required. While many approximation methods have been devised from the side of statistical modeling to ameliorate these polynomial complexities, we are interested here in the complementary approach of evaluating the exact algebraic result by exploiting advances in solution algorithms and many-core computer architectures. Using state-of-the-art high performance dense linear algebra libraries associated with various leading edge parallel architectures (Intel KNLs, NVIDIA GPUs, and distributed-memory systems), ExaGeoStat raises the game for statistical applications from climate and environmental science. ExaGeoStat provides a reference evaluation of statistical parameters, with which to assess the validity of the various approaches based on approximation. The framework takes a first step in the merger of large-scale data analytics and extreme computing for geospatial statistical applications, to be followed by additional complexity reducing improvements from the solver side that can be implemented under the same interface. Thus, a single uncompromised statistical model can ultimately be executed in a wide variety of emerging exascale environments.

[1]  J. Chilès,et al.  Geostatistics: Modeling Spatial Uncertainty , 1999 .

[2]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[3]  D. Nychka,et al.  Covariance Tapering for Interpolation of Large Spatial Datasets , 2006 .

[4]  Emmanuel Agullo,et al.  Task-Based Multifrontal QR Solver for GPU-Accelerated Multicore Architectures , 2015, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC).

[5]  Douglas W. Nychka,et al.  Covariance Tapering for Likelihood-Based Estimation in Large Spatial Data Sets , 2008 .

[6]  Michael L. Stein,et al.  Statistical Properties of Covariance Tapers , 2013 .

[7]  Daniel Sunderland,et al.  Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..

[8]  Zhiyi Chi,et al.  Approximating likelihoods for large spatial data sets , 2004 .

[9]  Sudipto Banerjee,et al.  Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets , 2014, Journal of the American Statistical Association.

[10]  David E. Keyes,et al.  Tile Low Rank Cholesky Factorization for Climate/Weather Modeling Applications on Manycore Architectures , 2017, ISC.

[11]  A. Gelfand,et al.  Gaussian predictive process models for large spatial data sets , 2008, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[12]  Jianhua Z. Huang,et al.  A full scale approximation of covariance functions for large spatial data sets , 2012 .

[13]  Noel A Cressie,et al.  Statistics for Spatio-Temporal Data , 2011 .

[14]  Huang Huang,et al.  Hierarchical Low Rank Approximation of Likelihoods for Large Spatial Datasets , 2016, 1605.08898.

[15]  M. Fuentes Approximate Likelihood for Large Irregularly Spaced Spatial Data , 2007, Journal of the American Statistical Association.

[16]  Ying Sun,et al.  Statistically and Computationally Efficient Estimating Equations for Large Spatial Datasets , 2016 .

[17]  Marcelo Serrano Zanetti,et al.  CodonPhyML: Fast Maximum Likelihood Phylogeny Estimation under Codon Substitution Models , 2013, Molecular biology and evolution.

[18]  P. Guttorp,et al.  Studies in the history of probability and statistics XLIX On the Matérn correlation family , 2006 .

[19]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[20]  M. Stein,et al.  A Bayesian analysis of kriging , 1993 .

[21]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[22]  D. Nychka,et al.  A Multiresolution Gaussian Process Model for the Analysis of Large Spatial Datasets , 2015 .

[23]  N. Cressie,et al.  Fixed rank kriging for very large spatial data sets , 2008 .

[24]  Alejandro Duran,et al.  A Proposal to Extend the OpenMP Tasking Model with Dependent Tasks , 2009, International Journal of Parallel Programming.