Parallel implementation of WRF double moment 5-class cloud microphysics scheme on multiple GPUs

The Weather Research and Forecast (WRF) Double Moment 5-class (WDM5) mixed ice microphysics scheme predicts the mixing ratio of hydrometeors and their number concentrations for warm rain species including clouds and rain. WDM5 can be computed in parallel in the horizontal domain using multi-core GPUs. In order to obtain a better GPU performance, we manually rewrite the original WDM5 Fortran module into a highly parallel CUDA C program. We explore the usage of coalesced memory access and asynchronous data transfer. Our GPU-based WDM5 module is scalable to run on multiple GPUs. By employing one NVIDIA Tesla K40 GPU, our GPU optimization effort on this scheme achieves a speedup of 252x with respect to its CPU counterpart Fortran code running on one CPU core of Intel Xeon E5-2603, whereas the speedup for one CPU socket (4 cores) with respect to one CPU core is only 4.2x. We can even boost the speedup of this scheme to 468x with respect to one CPU core when two NVIDIA Tesla K40 GPUs are applied.

[1]  G. Powers,et al.  A Description of the Advanced Research WRF Version 3 , 2008 .

[2]  Yuliya Tarabalka,et al.  Real-time anomaly detection in hyperspectral images using multivariate normal mixture models and GPU processing , 2009, Journal of Real-Time Image Processing.

[3]  J. Dudhia,et al.  A Revised Approach to Ice Microphysical Processes for the Bulk Parameterization of Clouds and Precipitation , 2004 .

[4]  Chulhee Lee,et al.  Constant coefficients linear prediction for lossless compression of ultraspectral sounder data using a graphics processing unit , 2010 .

[5]  Antonio J. Plaza,et al.  Improving the Performance of Hyperspectral Image and Signal Processing Algorithms Using Parallel, Distributed and Specialized Hardware-Based Systems , 2010, J. Signal Process. Syst..

[6]  Jason Sanders,et al.  CUDA by example: an introduction to general purpose GPU programming , 2010 .

[7]  Thomas Hobiger,et al.  Computation of Troposphere Slant Delays on a GPU , 2009, IEEE Transactions on Geoscience and Remote Sensing.

[8]  Bormin Huang,et al.  Accelerating Regular LDPC Code Decoders on GPUs , 2011, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[9]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[10]  Tom R. Halfhill NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[11]  Bormin Huang,et al.  Development of a GPU-based high-performance radiative transfer model for the Infrared Atmospheric Sounding Interferometer (IASI) , 2011, J. Comput. Phys..

[12]  Song‐You Hong,et al.  The WRF Single-Moment 6-Class Microphysics Scheme (WSM6) , 2006 .

[13]  Antonio J. Plaza,et al.  Parallel Morphological Endmember Extraction Using Commodity Graphics Hardware , 2007, IEEE Geoscience and Remote Sensing Letters.

[14]  Antonio J. Plaza,et al.  Recent Developments in High Performance Computing for Remote Sensing: A Review , 2011, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[15]  Uwe Stilla,et al.  Hybrid GPU-Based Single- and Double-Bounce SAR Simulation , 2009, IEEE Transactions on Geoscience and Remote Sensing.

[16]  Bormin Huang,et al.  GPU Acceleration of Predictive Partitioned Vector Quantization for Ultraspectral Sounder Data Compression , 2011, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[17]  Song-You Hong,et al.  Development of an Effective Double-Moment Cloud Microphysics Scheme with Prognostic Cloud Condensation Nuclei (CCN) for Weather and Climate Models , 2010 .

[18]  Yunsong Li,et al.  A GPU-Accelerated Wavelet Decompression System With SPIHT and Reed-Solomon Decoding for Satellite Images , 2011, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[19]  Roger Blowey,et al.  Best practice guides. , 2010 .