Scalable Multigrid-based Hierarchical Scientific Data Refactoring on GPUs

Rapid growth in scientific data and a widening gap between computational speed and I/O bandwidth makes it increasingly infeasible to store and share all data produced by scientific simulations. Instead, we need methods for reducing data volumes: ideally, methods that can scale data volumes adaptively so as to enable negotiation of performance and fidelity tradeoffs in different situations. Multigrid-based hierarchical data representations hold promise as a solution to this problem, allowing for flexible conversion between different fidelities so that, for example, data can be created at high fidelity and then transferred or stored at lower fidelity via logically simple and mathematically sound operations. However, the effective use of such representations has been hindered until now by the relatively high costs of creating, accessing, reducing, and otherwise operating on such representations. We describe here highly optimized data refactoring kernels for GPU accelerators that enable efficient creation and manipulation of data in multigrid-based hierarchical forms. We demonstrate that our optimized design can achieve up to 264 TB/s aggregated data refactoring throughput—92% of theoretical peak—on 1024 nodes of the Summit supercomputer. We showcase our optimized design by applying it to a large-scale scientific visualization workflow and the MGARD lossy compression software.

[1]  Choong-Seock Chang,et al.  Numerical study of neoclassical plasma pedestal in a tokamak geometry , 2004 .

[2]  Zizhong Chen,et al.  Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[3]  Scott Klasky,et al.  Multilevel Techniques for Compression and Reduction of Scientific Data - The Multivariate Case , 2019, SIAM J. Sci. Comput..

[4]  Katherine Yelick,et al.  Exascale applications: skin in the game , 2020, Philosophical Transactions of the Royal Society A.

[5]  Choong-Seock Chang,et al.  Full-f gyrokinetic particle simulation of centrally heated global ITG turbulence from magnetic axis to edge pedestal top in a realistic tokamak geometry , 2009 .

[6]  Samuel Williams,et al.  Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers , 2017, Parallel Comput..

[7]  Luke N. Olson,et al.  Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods , 2012, SIAM J. Sci. Comput..

[8]  V. Natoli,et al.  GAMPACK (GPU Accelerated Algebraic Multigrid Package) , 2012 .

[9]  Dingwen Tao,et al.  TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs , 2019, ICS.

[10]  L. A. G. Dresel,et al.  Elementary Numerical Analysis , 1966 .

[11]  Jieyang Chen,et al.  TSM2X: High-Performance Tall-and-Skinny Matrix-Matrix Multiplication on GPUs. , 2020 .

[12]  Lipeng Wan,et al.  Data Management Challenges of Exascale Scientific Simulations: A Case Study with the Gyrokinetic Toroidal Code and ADIOS , 2019 .

[13]  Scott Klasky,et al.  MGARD+: Optimizing Multi-grid Based Reduction for Efficient Scientific Data Management , 2020, ArXiv.

[14]  Sebastian Schops,et al.  Multi-GPU Acceleration of Algebraic Multi-Grid Preconditioners for Elliptic Field Problems , 2015, IEEE Transactions on Magnetics.

[15]  Arie Shoshani,et al.  Hello ADIOS: the challenges and lessons of developing leadership class I/O frameworks , 2014, Concurr. Comput. Pract. Exp..

[16]  Xu Liu,et al.  Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect , 2019, IEEE Transactions on Parallel and Distributed Systems.

[17]  J. E. Pearson Complex Patterns in a Simple System , 1993, Science.

[18]  Bálint Joó,et al.  Accelerating Lattice QCD Multigrid on GPUs Using Fine-Grained Parallelization , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Franck Cappello,et al.  Online data analysis and reduction: An important Co-design motif for extreme-scale computers , 2021, The international journal of high performance computing applications.

[20]  Scott Klasky,et al.  Accelerating Multigrid-based Hierarchical Scientific Data Refactoring on GPUs , 2020, ArXiv.

[21]  Zizhong Chen,et al.  GPU-ABFT: Optimizing Algorithm-Based Fault Tolerance for Heterogeneous Systems with GPUs , 2016, 2016 IEEE International Conference on Networking, Architecture and Storage (NAS).

[22]  C. Carilli,et al.  Science with the Square Kilometer Array , 2004, astro-ph/0409274.

[23]  Kai Zhao,et al.  Fault Tolerant One-sided Matrix Decompositions on Heterogeneous Systems with GPUs , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[24]  Naveen Sivadasan,et al.  GPU accelerated three dimensional unstructured geometric multigrid solver , 2014, 2014 International Conference on High Performance Computing & Simulation (HPCS).

[25]  Scott Klasky,et al.  Multilevel techniques for compression and reduction of scientific data—the univariate case , 2018, Comput. Vis. Sci..

[26]  Scott Klasky,et al.  Feature-preserving Lossy Compression for In Situ Data Analysis , 2020, ICPP Workshops.

[27]  Scott Klasky,et al.  Understanding Performance-Quality Trade-offs in Scientific Visualization Workflows with Lossy Compression , 2019, 2019 IEEE/ACM 5th International Workshop on Data Analysis and Reduction for Big Scientific Data (DRBSD-5).

[28]  Scott Klasky,et al.  Multilevel Techniques for Compression and Reduction of Scientific Data-Quantitative Control of Accuracy in Derived Quantities , 2019, SIAM J. Sci. Comput..

[29]  Xu Liu,et al.  Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU Benchmark Suite , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).