Using Intel Xeon Phi to accelerate the WRF TEMF planetary boundary layer scheme

The Weather Research and Forecasting (WRF) model is designed for numerical weather prediction and atmospheric research. The WRF software infrastructure consists of several components such as dynamic solvers and physics schemes. Numerical models are used to resolve the large-scale flow. However, subgrid-scale parameterizations are for an estimation of small-scale properties (e.g., boundary layer turbulence and convection, clouds, radiation). Those have a significant influence on the resolved scale due to the complex nonlinear nature of the atmosphere. For the cloudy planetary boundary layer (PBL), it is fundamental to parameterize vertical turbulent fluxes and subgrid-scale condensation in a realistic manner. A parameterization based on the Total Energy – Mass Flux (TEMF) that unifies turbulence and moist convection components produces a better result that the other PBL schemes. For that reason, the TEMF scheme is chosen as the PBL scheme we optimized for Intel Many Integrated Core (MIC), which ushers in a new era of supercomputing speed, performance, and compatibility. It allows the developers to run code at trillions of calculations per second using the familiar programming model. In this paper, we present our optimization results for TEMF planetary boundary layer scheme. The optimizations that were performed were quite generic in nature. Those optimizations included vectorization of the code to utilize vector units inside each CPU. Furthermore, memory access was improved by scalarizing some of the intermediate arrays. The results show that the optimization improved MIC performance by 14.8x. Furthermore, the optimizations increased CPU performance by 2.6x compared to the original multi-threaded code on quad core Intel Xeon E5-2603 running at 1.8 GHz. Compared to the optimized code running on a single CPU socket the optimized MIC code is 6.2x faster.

[1]  Thorsten Mauritsen,et al.  Performance of an Eddy Diffusivity-Mass Flux Scheme for Shallow Cumulus Boundary Layers , 2010 .

[2]  Rezaur Rahman,et al.  Intel Xeon Phi Coprocessor Architecture and Tools: The Guide for Application Developers , 2013 .

[3]  James Reinders,et al.  Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[4]  Ioannis E. Venetis,et al.  Porting FEASTFLOW to the Intel Xeon Phi: Lessons Learned , 2014 .

[5]  Larry Meadows,et al.  Experiments with WRF on Intel® Many Integrated Core (Intel MIC) Architecture , 2012, IWOMP.

[6]  Qiang Li,et al.  Parallel simulation of high‐dimensional American option pricing based on CPU versus MIC , 2015, Concurr. Comput. Pract. Exp..

[7]  Denys Dutykh,et al.  Enabling the UCD-SPH code on the Xeon Phi , 2014 .

[8]  Matthias S. Müller,et al.  OpenMP in a Heterogeneous World , 2012, Lecture Notes in Computer Science.

[9]  Kent Milfeld,et al.  Discovery of biological networks using an optimized partial correlation coefficient with information theory algorithm on Stampede's Xeon and Xeon Phi processors , 2014, Concurr. Comput. Pract. Exp..

[10]  G. Powers,et al.  A Description of the Advanced Research WRF Version 3 , 2008 .

[11]  Rezaur Rahman Intel® Xeon Phi™ Coprocessor Architecture and Tools , 2013, Apress.

[12]  Jun Wang,et al.  MICA: A fast short-read aligner that takes full advantage of Many Integrated Core Architecture (MIC) , 2014, BMC Bioinformatics.

[13]  Iain Bethune,et al.  Optimising CP2K for the Intel Xeon Phi , 2013 .

[14]  Jianbin Fang,et al.  Test-driving Intel Xeon Phi , 2014, ICPE.

[15]  Xinmin Tian,et al.  Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.