Computational Benefit of GPU Optimization for the Atmospheric Chemistry Modeling

Global chemistry‐climate models are computationally burdened as the chemical mechanisms become more complex and realistic. Optimization for graphics processing units (GPU) may make longer global simulation with regional detail possible, but limited study has been done to explore the potential benefit for the atmospheric chemistry modeling. Hence, in this study, the second‐order Rosenbrock solver of the chemistry module of CAM4‐Chem is ported to the GPU to gauge potential speed‐up. We find that on the CPU, the fastest performance is achieved using the Intel compiler with a block interleaved memory layout. Different combinations of compiler and memory layout lead to ~11.02× difference in the computational time. In contrast, the GPU version performs the best when using a combination of fully interleaved memory layout with block size equal to the warp size, CUDA streams for independent kernels, and constant memory. Moreover, the most efficient data transfer between CPU and GPU is gained by allocating the memory contiguously during the data initialization on the GPU. Compared to one CPU core, the speed‐up of using one GPU alone reaches a factor of ~11.7× for the computation alone and ~3.82× when the data transfer between CPU and GPU is considered. Using one GPU alone is also generally faster than the multithreaded implementation for 16 CPU cores in a compute node and the single‐source solution (OpenACC). The best performance is achieved by the implementation of the hybrid CPU/GPU version, but rescheduling the workload among the CPU cores is required before the practical CAM4‐Chem simulation.

[1]  Ben Hipwell Upcoming Meetings , 2011, Journal of Herpetological Medicine and Surgery.

[2]  Willem Hundsdorfer,et al.  A Second-Order Rosenbrock Method Applied to Photochemical Dispersion Problems , 1999, SIAM J. Sci. Comput..

[3]  L. Kleinman,et al.  Sensitivity of ozone production rate to ozone precursors , 2001 .

[4]  J. Lamarque,et al.  A global simulation of tropospheric ozone and related tracers: Description and evaluation of MOZART, version 2 , 2001 .

[5]  C. Brühl,et al.  Uncertainties and assessments of chemistry-climate models of the stratosphere , 2002 .

[6]  Shian‐Jiann Lin A “Vertically Lagrangian” Finite-Volume Dynamical Core for Global Models , 2004 .

[7]  Patrick H. Worley,et al.  Performance Portability in the Physical Parameterizations of the Community Atmospheric Model , 2005, Int. J. High Perform. Comput. Appl..

[8]  M. Chipperfield,et al.  A new coupled chemistry–climate model for the stratosphere: The importance of coupling for future O3‐climate predictions , 2005 .

[9]  T. Diehl,et al.  Sensitivity of chemical tracers to meteorological parameters in the MOZART-3 chemical transport model , 2007 .

[10]  S. Brönnimann,et al.  Technical Note: Chemistry-climate model SOCOL: version 2.0 with improved transport and chemistry/microphysics schemes , 2008 .

[11]  Manish Vachharajani,et al.  GPU acceleration of numerical weather prediction , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[12]  U. Lohmann,et al.  Atmospheric Composition Change: Climate-Chemistry Interactions , 2009 .

[13]  Adrian Sandu,et al.  Multi-core acceleration of chemical kinetics for simulation and prediction , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[14]  J. Neirynck,et al.  Atmospheric composition change: Ecosystems–Atmosphere interactions , 2009 .

[15]  J. Lamarque,et al.  Description and evaluation of the Model for Ozone and Related chemical Tracers, version 4 (MOZART-4) , 2009 .

[16]  M. Andreae,et al.  Soil Nitrite as a Source of Atmospheric HONO and OH Radicals , 2011, Science.

[17]  Arthur A. Mirin,et al.  Improving the performance scalability of the community atmosphere model , 2012, Int. J. High Perform. Comput. Appl..

[18]  J. Lamarque,et al.  The Atmospheric Chemistry and Climate Model Intercomparison Project (ACCMIP): overview and description of models, simulations and climate diagnostics , 2012 .

[19]  Sathish S. Vadhiyar,et al.  GPU-enabled efficient executions of radiation calculations in climate modeling , 2013, 20th Annual International Conference on High Performance Computing.

[20]  Satoshi Matsuoka,et al.  CUDA vs OpenACC: Performance Case Studies with Kernel Benchmarks and a Memory-Bound CFD Application , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[21]  M. Dameris,et al.  Numerical Modeling of Climate-Chemistry Connections: Recent Developments and Future Challenges , 2013, ATMOS 2013.

[22]  Jack J. Dongarra,et al.  LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU , 2014, 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS).

[23]  Guangwen Yang,et al.  POM.gpu-v1.0: a GPU-based Princeton Ocean Model , 2015 .

[24]  Jack Dongarra,et al.  A Proposed API for Batched Basic Linear Algebra Subprograms , 2016 .

[25]  Jack J. Dongarra,et al.  Performance, Design, and Autotuning of Batched GEMM for GPUs , 2016, ISC.

[26]  J. Lamarque,et al.  AerChemMIP: quantifying the effects of chemistry and aerosols in CMIP6 , 2016 .

[27]  Jack J. Dongarra,et al.  Performance analysis and acceleration of explicit integration for large kinetic networks using batched GPU computations , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[28]  Jeffrey Overbey,et al.  COMPARING PROGRAMMER PRODUCTIVITY IN OPENACC AND CUDA: AN EMPIRICAL INVESTIGATION , 2016 .

[29]  David J Wales,et al.  GPU-Accelerated Exploration of Biomolecular Energy Landscapes. , 2016, Journal of chemical theory and computation.

[30]  R. Neely,et al.  Representation of the Community Earth System Model (CESM1) CAM4-chem within the Chemistry-Climate Model Initiative (CCMI) , 2016 .

[31]  Jack J. Dongarra,et al.  Factorization and Inversion of a Million Matrices using GPUs: Challenges and Countermeasures , 2017, ICCS.

[32]  J. Lamarque,et al.  Improvement of the prediction of surface ozone concentration over conterminous U.S. by a computationally efficient second‐order Rosenbrock solver in CAM4‐Chem , 2017 .

[33]  Christoph W. Kessler,et al.  Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption , 2017, ARMS-CC@PODC.

[34]  Michail Alvanos,et al.  GPU-accelerated atmospheric chemical kinetics in the ECHAM/MESSy (EMAC) Earth system model (version 2.52) , 2017 .

[35]  Jack J. Dongarra,et al.  Autotuning Batch Cholesky Factorization in CUDA with Interleaved Layout of Matrices , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[36]  Stanimire Tomov,et al.  A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations , 2018, IEEE Transactions on Parallel and Distributed Systems.