Asynchronous and synchronous models of executions on Intel® Xeon Phi™ coprocessor systems for high performance of long wave radiation calculations in atmosphere models

Long Wave Radiation Calculations are one of the most time-consuming calculations in atmosphere modeling. In this work, we explore two models for executions of these calculations on Intelź Xeon Phiź Coprocessor Systems. In the asynchronous model, we offload the radiation calculations to the coprocessors and simultaneously execute calculations on the coprocessors along with the other atmosphere model calculations in the CPU cores. In the synchronous model, the CPU cores after offloading, wait for the results, and use the results in the same time step. We developed various techniques to complete these synchronous executions in minimal time, including loop rearrangement and low-cost interpolations. Using our experiments on an Intel Xeon Phi cluster, we show that our asynchronous execution model results in savings of many months in wall-clock execution time for multi-century climate simulations. Our synchronous execution model results in performance improvements of up to 70% in long-wave radiation calculations. Asynchronous and synchronous executions models for radiations on Intel Xeon Phi.Techniques including loop rearrangement and early placement, low-cost substitutions.Performance improvement of up to 45% due to our asynchronous model.Wall-clock time savings of up to 2.25 years for multi-century climate simulations.Performance improvement of up to 70% due to our synchronous model.

[1]  Satoshi Matsuoka,et al.  An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  W. Collins,et al.  The Community Earth System Model: A Framework for Collaborative Research , 2013 .

[3]  W. Collins,et al.  The Formulation and Atmospheric Simulation of the Community Atmosphere Model Version 3 (CAM3) , 2006 .

[4]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[5]  Yongchao Liu,et al.  SWAPHI-LS: Smith-Waterman Algorithm on Xeon Phi coprocessors for Long DNA Sequences , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[6]  David L. Williamson,et al.  The Accumulation of Rounding Errors and Port Validation for Global Atmospheric Models , 1997, SIAM J. Sci. Comput..

[7]  Elizabeth R. Jessup,et al.  Optimizing Weather Model Radiative Transfer Physics for Intel's Many Integrated Core (MIC) Architecture , 2016, Parallel Process. Lett..

[8]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[9]  Bormin Huang,et al.  Performance tuning Weather Research and Forecasting (WRF) Goddard longwave radiative transfer scheme on Intel Xeon Phi , 2015, SPIE Remote Sensing.

[10]  Tobias Gysi,et al.  Towards a performance portable, architecture agnostic implementation strategy for weather and climate models , 2014, Supercomput. Front. Innov..

[11]  R. Neale,et al.  The Mean Climate of the Community Atmosphere Model (CAM4) in Forced SST and Fully Coupled Experiments , 2013 .

[12]  Jun Wang,et al.  MICA: A fast short-read aligner that takes full advantage of Many Integrated Core Architecture (MIC) , 2014, BMC Bioinformatics.

[13]  Mark A. Taylor,et al.  Progress towards accelerating HOMME on hybrid multi-core systems , 2013, Int. J. High Perform. Comput. Appl..

[14]  James Reinders,et al.  Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[15]  Bormin Huang,et al.  Optimizing the updated Goddard shortwave radiation Weather Research and Forecasting (WRF) scheme for Intel Many Integrated Core (MIC) architecture , 2015, Commercial + Scientific Sensing and Imaging.

[16]  Hong Zhang,et al.  GPU Acceleration of a Cloud Resolving Model using CUDA , 2012, ICCS.

[17]  Jim Jeffers,et al.  High Performance Parallelism Pearls Volume Two: Multicore and Many-core Programming Approaches , 2015 .

[18]  A. P. Siebesma,et al.  Weather Forecasting Using GPU-Based Large-Eddy Simulations , 2015 .

[19]  Pradeep Dubey,et al.  Lattice QCD with Domain Decomposition on Intel® Xeon Phi Co-Processors , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Tom Henderson,et al.  Running the NIM Next-Generation Weather Model on GPUs , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[21]  Jim Jeffers,et al.  High performance parallelism pearls , 2015 .

[22]  James M. Bieman,et al.  Mesa: automatic generation of lookup table optimizations , 2011, IWMSE '11.

[23]  Manish Vachharajani,et al.  GPU acceleration of numerical weather prediction , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[24]  H. Goosse,et al.  Introduction to climate dynamics and climate modeling , 2010 .

[25]  Bormin Huang,et al.  Revisiting Intel Xeon Phi optimization of Thompson cloud microphysics scheme in Weather Research and Forecasting (WRF) model , 2015, SPIE Remote Sensing.