Ultra-Scalable CPU-MIC Acceleration of Mesoscale Atmospheric Modeling on Tianhe-2

In this work an ultra-scalable algorithm is designed and optimized to accelerate a 3D compressible Euler atmospheric model on the CPU-MIC hybrid system of Tianhe-2. We first reformulate the mesocale model to avoid long-latency operations, and then employ carefully designed inter-node and intra-node domain decomposition algorithms to achieve balance utilization of different computing units. Proper communication-computation overlap and concurrent data transfer methods are utilized to reduce the cost of data movement at scale. A variety of optimization techniques on both the CPU side and the accelerator side are exploited to enhance the in-socket performance. The proposed hybrid algorithm successfully scales to 6,144 Tianhe-2 nodes with a nearly ideal weak scaling efficiency, and achieve over 8 percent of the peak performance in double precision. This ultra-scalable hybrid algorithm may be of interest to the community to accelerating atmospheric models on increasingly dominated heterogeneous supercomputers.

[1]  Mariana Vertenstein,et al.  Computational performance of ultra-high-resolution capability in the Community Earth System Model , 2012, Int. J. High Perform. Comput. Appl..

[2]  Satoshi Matsuoka,et al.  Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[3]  Chao Yang,et al.  A Scalable Fully Implicit Compressible Euler Solver for Mesoscale Nonhydrostatic Simulation of Atmospheric Flows , 2014, SIAM J. Sci. Comput..

[4]  Chao Yang,et al.  A peta-scalable CPU-GPU algorithm for global atmospheric simulations , 2013, PPoPP '13.

[5]  Takashi Shimokawabe,et al.  145 TFlops Performance on 3990 GPUs of TSUBAME 2.0 Supercomputer for an Operational Weather Prediction , 2011, ICCS.

[6]  William Putman,et al.  The finite-volume dynamical core on the cubed-sphere , 2006, SC.

[7]  Satoshi Matsuoka,et al.  An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Samuel Williams,et al.  Implicit and explicit optimizations for stencil computations , 2006, MSPC '06.

[9]  Dirk Schmidl,et al.  Assessing the Performance of OpenMP Programs on the Intel Xeon Phi , 2013, Euro-Par.

[10]  Masaki Satoh,et al.  Conservative scheme for the compressible nonhydrostatic models with the horizontally explicit and vertically implicit time integration scheme , 2002 .

[11]  Stephen A. Jarvis,et al.  Exploring SIMD for Molecular Dynamics , 2013 .

[12]  Tom Henderson,et al.  Running the NIM Next-Generation Weather Model on GPUs , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[13]  Christiane Jablonowski,et al.  Operator-Split Runge-Kutta-Rosenbrock Methods for Nonhydrostatic Atmospheric Models , 2012 .

[14]  Ricardo Bianchini,et al.  Using communication-to-computation ratio in parallel program design and performance prediction , 1992, [1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing.

[15]  Hamid Jafarkhani,et al.  On the computation and reduction of the peak-to-average power ratio in multicarrier communications , 2000, IEEE Trans. Commun..

[16]  Diego Rossinelli,et al.  High throughput software for direct numerical simulations of compressible two-phase flows , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Alan Norton,et al.  Petascale WRF simulation of hurricane sandy: Deployment of NCSA's cray XE6 blue waters , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[18]  Williama Putnam Graphics Processing Unit (GPU) Acceleration of the Goddard Earth Observing System Atmospheric Model , 2011 .

[19]  Fan Zhang,et al.  Cluster-Size Scaling and MapReduce Execution Times , 2013, 2013 IEEE 5th International Conference on Cloud Computing Technology and Science.

[20]  Giuseppe Coviello,et al.  COSMIC: middleware for high performance and reliable multiprocessing on xeon phi coprocessors , 2013, HPDC '13.

[21]  Sabela Ramos,et al.  Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi , 2013, HPDC.

[22]  Chao Yang,et al.  Enabling and Scaling a Global Shallow-Water Atmospheric Model on Tianhe-2 , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[23]  Xing Liu,et al.  Efficient sparse matrix-vector multiplication on x86-based many-core processors , 2013, ICS '13.

[24]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  Pradeep Dubey,et al.  Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[26]  Nicholas J. Wright,et al.  WRF nature run , 2008 .

[27]  Nikolaus A. Adams,et al.  11 PFLOP/s simulations of cloud cavitation collapse , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[28]  N. Phillips,et al.  Scale Analysis of Deep and Shallow Convection in the Atmosphere , 1962 .

[29]  Satoshi Matsuoka,et al.  Multi-GPU Implementation of the NICAM Atmospheric Model , 2012, Euro-Par Workshops.

[30]  Volker Strumpen,et al.  The memory behavior of cache oblivious stencil computations , 2007, The Journal of Supercomputing.

[31]  Manish Vachharajani,et al.  GPU acceleration of numerical weather prediction , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[32]  P. Lauritzen Numerical techniques for global atmospheric models , 2011 .

[33]  Matthias Christen,et al.  Patus for convenient high-performance stencils: Evaluation in earthquake simulations , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[34]  Stephen A. Jarvis,et al.  Exploring SIMD for Molecular Dynamics, Using Intel® Xeon® Processors and Intel® Xeon Phi Coprocessors , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[35]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[36]  Lukasz Szustak Czestochowa,et al.  Toward efficient distribution of MPDATA stencil computation on Intel MIC architecture , 2013 .

[37]  Mark A. Taylor,et al.  Progress towards accelerating HOMME on hybrid multi-core systems , 2013, Int. J. High Perform. Comput. Appl..

[38]  Jing Sun,et al.  GPU acceleration of the WSM6 cloud microphysics scheme in GRAPES model , 2013, Comput. Geosci..

[39]  Mikhail Smelyanskiy,et al.  Efficient backprojection-based synthetic aperture radar computation with many-core processors , 2012, HiPC 2012.