OpenMP 4 Fortran Modernization of WSM6 for KNL

Parallel code portability in the petascale era requires modifying existing codes to support new architectures with large core counts and SIMD vector units. OpenMP is a well established and increasingly supported vehicle for portable parallelization. As architectures mature and compiler OpenMP implementations evolve, best practices for code modernization change as well. In this paper, we examine the impact of newer OpenMP features (in particular OMP SIMD) on the Intel Xeon Phi Knights Landing (KNL) architecture, applied in optimizing loops in the single moment 6-class microphysics module (WSM6) in the US Navy's NEPTUNE code. We find that with functioning OMP SIMD constructs, low thread invocation overhead on KNL and reduced penalty for unaligned access compared to previous architectures, one can leverage OpenMP 4 to achieve reasonable scalability with relatively minor reorganization of a production physics code.

[1]  J. M. Bull,et al.  Measuring Synchronisation and Scheduling Overheads in OpenMP , 2007 .

[2]  Bormin Huang,et al.  Optimizing Purdue-Lin Microphysics Scheme for Intel Xeon Phi Coprocessor , 2016, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[3]  Elizabeth R. Jessup,et al.  Optimizing Weather Model Radiative Transfer Physics for Intel's Many Integrated Core (MIC) Architecture , 2016, Parallel Process. Lett..

[4]  James LaGrone,et al.  A Set of Microbenchmarks for Measuring OpenMP Task Overheads , 2011 .

[5]  Bormin Huang,et al.  Improved GPU/CUDA Based Parallel Weather and Research Forecast (WRF) Single Moment 5-Class (WSM5) Cloud Microphysics , 2012, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[6]  James D. Doyle Next Generation NWP Using a Spectral Element Dynamical Core , 2017 .

[7]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[8]  Song‐You Hong,et al.  The WRF Single-Moment 6-Class Microphysics Scheme (WSM6) , 2006 .

[9]  Santa Clara,et al.  GPU ACCELERATION OF THE LONG-WAVE RAPID RADIATIVE TRANSFER MODEL IN WRF USING CUDA FORTRAN , 2010 .

[10]  Jarno Mielikainen,et al.  Intel Xeon Phi accelerated Weather Research and Forecasting (WRF) Goddard microphysics scheme , 2014 .

[11]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[12]  Bormin Huang,et al.  GPU acceleration experience with RRTMG long wave radiation model , 2013, Remote Sensing.

[13]  Jim Jeffers,et al.  Chapter 13 – Performance libraries , 2016 .

[14]  Vassilios V. Dimakopoulos,et al.  A Microbenchmark Study of OpenMP Overheads under Nested Parallelism , 2008, IWOMP.

[15]  James D Doyle A Next Generation Atmospheric Prediction System for the Navy , 2015 .

[16]  Manish Vachharajani,et al.  GPU acceleration of numerical weather prediction , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[17]  Fiona Reid,et al.  A Microbenchmark Suite for OpenMP Tasks , 2012, IWOMP.

[18]  G. Moore Cramming more components onto integrated circuits, Reprinted from Electronics, volume 38, number 8, April 19, 1965, pp.114 ff. , 2006, IEEE Solid-State Circuits Newsletter.

[19]  J. Mark Bull,et al.  A microbenchmark suite for OpenMP 2.0 , 2001, CARN.

[20]  Avinash Sodani,et al.  Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition 2nd Edition , 2016 .

[21]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[22]  Song‐You Hong,et al.  Forward Semi-Lagrangian Advection with Mass Conservation and Positive Definiteness for Falling Hydrometeors , 2010 .