Performance Optimization Strategies for WRF Physics Schemes Used in Weather Modeling

Performance optimization in the petascale era and beyond in the exascale era has and will require modifications of legacy codes to take advantage of new architectures with large core counts and SIMD units. The Numerical Weather Prediction (NWP) physics codes considered here are optimized using thread-local structures of arrays (SOA). High-level and low-level optimization strategies are applied to the WRF Single-Moment 6-Class Microphysics Scheme (WSM6) and Global Forecast System (GFS) physics codes used in the NEPTUNE forecast code. By building on previous work optimizing WSM6 on the Intel Knights Landing (KNL), it is shown how to further optimize WMS6 and GFS physics, and GFS radiation on Intel KNL, Haswell, and potentially on future micro-architectures with many cores and SIMD vector units. The optimization techniques used herein employ thread-local structures of arrays (SOA), an OpenMP directive, OMP SIMD, and minor code transformations to enable better utilization of SIMD units, increase parallelism, improve locality, and reduce memory traffic. The optimized versions of WSM6, GFS physics, GFS radiation run 70, 27, and 23 faster (respectively) on KNL and 26, 18 and 30 faster (respectively) on Haswell than their respective original serial versions. Although this work targets WRF physics schemes, the findings are transferable to other performance optimization contexts and provide insight into the optimization of codes with complex physical models for present and near-future architectures with many core and vector units.

[1]  Bormin Huang,et al.  Improved GPU/CUDA Based Parallel Weather and Research Forecast (WRF) Single Moment 5-Class (WSM5) Cloud Microphysics , 2012, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[2]  Gabriele Mencagli,et al.  The home-forwarding mechanism to reduce the cache coherence overhead in next-generation CMPs , 2018, Future Gener. Comput. Syst..

[3]  Martin Berzins,et al.  Optimization Strategies for WRF Single-Moment 6-Class Microphysics Scheme (WSM6) on Intel Microarchitectures , 2017, 2017 Fifth International Symposium on Computing and Networking (CANDAR).

[4]  Marco Danelutto,et al.  P3ARSEC: towards parallel patterns benchmarking , 2017, SAC.

[5]  Francis X. Giraldo,et al.  Strong scaling for numerical weather prediction at petascale with the atmospheric model NUMA , 2015, Int. J. High Perform. Comput. Appl..

[6]  Paul R. Woodward,et al.  mPPM, Viewed as a Co-Design Effort , 2014, 2014 Hardware-Software Co-Design for High Performance Computing.

[7]  Bormin Huang,et al.  GPU acceleration experience with RRTMG long wave radiation model , 2013, Remote Sensing.

[8]  Peter Bauer,et al.  The quiet revolution of numerical weather prediction , 2015, Nature.

[9]  Holger Homann,et al.  SoAx: A generic C++ Structure of Arrays for handling particles in HPC codes , 2017, Comput. Phys. Commun..

[10]  Jim Jeffers,et al.  Knights Landing architecture , 2016 .

[11]  Keshav Pingali,et al.  An experimental comparison of cache-oblivious and cache-conscious programs , 2007, SPAA '07.

[12]  Franz Franchetti,et al.  Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures , 2011, CC.

[13]  Martin Hirzel,et al.  Data layouts for object-oriented programs , 2007, SIGMETRICS '07.

[14]  Jarno Mielikainen,et al.  Intel Xeon Phi accelerated Weather Research and Forecasting (WRF) Goddard microphysics scheme , 2014 .

[15]  G. Powers,et al.  A Description of the Advanced Research WRF Version 3 , 2008 .

[16]  Emil M. Constantinescu,et al.  Implicit-Explicit Formulations of a Three-Dimensional Nonhydrostatic Unified Model of the Atmosphere (NUMA) , 2013, SIAM J. Sci. Comput..

[17]  Martin Berzins,et al.  OpenMP 4 Fortran Modernization of WSM6 for KNL , 2017, PEARC.

[18]  Paul R. Woodward,et al.  Moving Scientific Codes to Multicore Microprocessor CPUs , 2008, Computing in Science & Engineering.

[19]  Marco Danelutto,et al.  A LIGHTWEIGHT RUN-TIME SUPPORT FOR FAST DENSE LINEAR ALGEBRA ON MULTI-CORE , 2014 .

[20]  George Chrysos,et al.  Intel® Xeon Phi coprocessor (codename Knights Corner) , 2012, 2012 IEEE Hot Chips 24 Symposium (HCS).

[21]  Melin Huang,et al.  GPU-Accelerated Longwave Radiation Scheme of the Rapid Radiative Transfer Model for General Circulation Models (RRTMG) , 2014, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[22]  James D Doyle A Next Generation Atmospheric Prediction System for the Navy , 2015 .

[23]  Manish Vachharajani,et al.  GPU acceleration of numerical weather prediction , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[24]  Elizabeth R. Jessup,et al.  Optimizing Weather Model Radiative Transfer Physics for Intel's Many Integrated Core (MIC) Architecture , 2016, Parallel Process. Lett..

[25]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[26]  Bormin Huang,et al.  Optimizing Purdue-Lin Microphysics Scheme for Intel Xeon Phi Coprocessor , 2016, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[27]  Song‐You Hong,et al.  Forward Semi-Lagrangian Advection with Mass Conservation and Positive Definiteness for Falling Hydrometeors , 2010 .

[28]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .