论文信息 - Ultra-Scalable CPU-MIC Acceleration of Mesoscale Atmospheric Modeling on Tianhe-2

Ultra-Scalable CPU-MIC Acceleration of Mesoscale Atmospheric Modeling on Tianhe-2

In this work an ultra-scalable algorithm is designed and optimized to accelerate a 3D compressible Euler atmospheric model on the CPU-MIC hybrid system of Tianhe-2. We first reformulate the mesocale model to avoid long-latency operations, and then employ carefully designed inter-node and intra-node domain decomposition algorithms to achieve balance utilization of different computing units. Proper communication-computation overlap and concurrent data transfer methods are utilized to reduce the cost of data movement at scale. A variety of optimization techniques on both the CPU side and the accelerator side are exploited to enhance the in-socket performance. The proposed hybrid algorithm successfully scales to 6,144 Tianhe-2 nodes with a nearly ideal weak scaling efficiency, and achieve over 8 percent of the peak performance in double precision. This ultra-scalable hybrid algorithm may be of interest to the community to accelerating atmospheric models on increasingly dominated heterogeneous supercomputers.

[1] Mariana Vertenstein,et al. Computational performance of ultra-high-resolution capability in the Community Earth System Model , 2012, Int. J. High Perform. Comput. Appl..

[2] Satoshi Matsuoka,et al. Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[3] Chao Yang,et al. A Scalable Fully Implicit Compressible Euler Solver for Mesoscale Nonhydrostatic Simulation of Atmospheric Flows , 2014, SIAM J. Sci. Comput..

[4] Chao Yang,et al. A peta-scalable CPU-GPU algorithm for global atmospheric simulations , 2013, PPoPP '13.

[5] Takashi Shimokawabe,et al. 145 TFlops Performance on 3990 GPUs of TSUBAME 2.0 Supercomputer for an Operational Weather Prediction , 2011, ICCS.

[6] William Putman,et al. The finite-volume dynamical core on the cubed-sphere , 2006, SC.

[7] Satoshi Matsuoka,et al. An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[8] Samuel Williams,et al. Implicit and explicit optimizations for stencil computations , 2006, MSPC '06.

[9] Dirk Schmidl,et al. Assessing the Performance of OpenMP Programs on the Intel Xeon Phi , 2013, Euro-Par.

[10] Masaki Satoh,et al. Conservative scheme for the compressible nonhydrostatic models with the horizontally explicit and vertically implicit time integration scheme , 2002 .

[11] Stephen A. Jarvis,et al. Exploring SIMD for Molecular Dynamics , 2013 .

[12] Tom Henderson,et al. Running the NIM Next-Generation Weather Model on GPUs , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[13] Christiane Jablonowski,et al. Operator-Split Runge-Kutta-Rosenbrock Methods for Nonhydrostatic Atmospheric Models , 2012 .

[14] Ricardo Bianchini,et al. Using communication-to-computation ratio in parallel program design and performance prediction , 1992, [1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing.

[15] Hamid Jafarkhani,et al. On the computation and reduction of the peak-to-average power ratio in multicarrier communications , 2000, IEEE Trans. Commun..

[16] Diego Rossinelli,et al. High throughput software for direct numerical simulations of compressible two-phase flows , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[17] Alan Norton,et al. Petascale WRF simulation of hurricane sandy: Deployment of NCSA's cray XE6 blue waters , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[18] Williama Putnam. Graphics Processing Unit (GPU) Acceleration of the Goddard Earth Observing System Atmospheric Model , 2011 .

[19] Fan Zhang,et al. Cluster-Size Scaling and MapReduce Execution Times , 2013, 2013 IEEE 5th International Conference on Cloud Computing Technology and Science.

[20] Giuseppe Coviello,et al. COSMIC: middleware for high performance and reliable multiprocessing on xeon phi coprocessors , 2013, HPDC '13.

[21] Sabela Ramos,et al. Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi , 2013, HPDC.

[22] Chao Yang,et al. Enabling and Scaling a Global Shallow-Water Atmospheric Model on Tianhe-2 , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[23] Xing Liu,et al. Efficient sparse matrix-vector multiplication on x86-based many-core processors , 2013, ICS '13.

[24] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[25] Pradeep Dubey,et al. Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[26] Nicholas J. Wright,et al. WRF nature run , 2008 .

[27] Nikolaus A. Adams,et al. 11 PFLOP/s simulations of cloud cavitation collapse , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[28] N. Phillips,et al. Scale Analysis of Deep and Shallow Convection in the Atmosphere , 1962 .

[29] Satoshi Matsuoka,et al. Multi-GPU Implementation of the NICAM Atmospheric Model , 2012, Euro-Par Workshops.

[30] Volker Strumpen,et al. The memory behavior of cache oblivious stencil computations , 2007, The Journal of Supercomputing.

[31] Manish Vachharajani,et al. GPU acceleration of numerical weather prediction , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[32] P. Lauritzen. Numerical techniques for global atmospheric models , 2011 .

[33] Matthias Christen,et al. Patus for convenient high-performance stencils: Evaluation in earthquake simulations , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[34] Stephen A. Jarvis,et al. Exploring SIMD for Molecular Dynamics, Using Intel® Xeon® Processors and Intel® Xeon Phi Coprocessors , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[35] Pradeep Dubey,et al. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[36] Lukasz Szustak Czestochowa,et al. Toward efficient distribution of MPDATA stencil computation on Intel MIC architecture , 2013 .

[37] Mark A. Taylor,et al. Progress towards accelerating HOMME on hybrid multi-core systems , 2013, Int. J. High Perform. Comput. Appl..

[38] Jing Sun,et al. GPU acceleration of the WSM6 cloud microphysics scheme in GRAPES model , 2013, Comput. Geosci..

[39] Mikhail Smelyanskiy,et al. Efficient backprojection-based synthetic aperture radar computation with many-core processors , 2012, HiPC 2012.