Parallel 3D deterministic particle transport on Intel MIC architecture

Single-node computation speed is essential in large-scale parallel solutions of particle transport problems. The Intel Many Integrated Core (MIC) architecture supports more than 200 hardware threads as well as 512-bit double precision float-point vector operations. In this paper, we use the native model of MIC in the parallelization of the simulation of one energy group time-independent deterministic discrete ordinates particle transport in 3D Cartesian geometry (Sweep3D). The implementation adopts both hardware threads and vector units in MIC to efficiently exploit multi-level parallelism in the discrete ordinates method when keeping good data locality. Our optimized implementation is verified on target MIC and can provide up to 1.99 times speedup based on the original MPI code on Intel Xeon E5-2660 CPU when flux fixup is off. Compared with the prior on NVIDIA Tesla M2050 GPU, the speedup of up to 1.23 times is obtained. In addition, the difference between the implementations on MIC and GPU is discussed as well.

[1]  Adolfy Hoisie,et al.  Scalability analysis of multidimensional wavefront algorithms on large-scale SMP clusters , 1999, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[2]  Adolfy Hoisie,et al.  Performance and Scalability Analysis of Teraflop-Scale Parallel Architectures Using Multidimensional Wavefront Applications , 2000, Int. J. High Perform. Comput. Appl..

[3]  Stephen A. Jarvis,et al.  Exploring SIMD for Molecular Dynamics, Using Intel® Xeon® Processors and Intel® Xeon Phi Coprocessors , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[4]  Fabrizio Petrini,et al.  Multicore Surprises: Lessons Learned from Optimizing Sweep3D on the Cell Broadband Engine , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[5]  Ümit V. Çatalyürek,et al.  An Early Evaluation of the Scalability of Graph Algorithms on the Intel MIC Architecture , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[6]  Yi Zheng,et al.  A Green's function formalism of energy and momentum transfer in fluctuational electrodynamics , 2013, 1302.0545.

[7]  F. Xavier Trias,et al.  Parallel algorithms for Sn transport sweeps on unstructured meshes , 2013, J. Comput. Phys..

[8]  Gao Tao,et al.  Using MIC to Accelerate a Typical Data-Intensive Application: The Breadth-first Search , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[9]  Manuel Kindelan,et al.  Application of the RBF meshless method to the solution of the radiative transport equation , 2010, J. Comput. Phys..

[10]  Yousry Y. Azmy,et al.  Comparison via parallel performance models of angular and spatial domain decompositions for solving neutral particle transport problems , 2007 .

[11]  R. Hentschke Non-Equilibrium Thermodynamics , 2014 .

[12]  Pradeep Dubey,et al.  Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[13]  Z. Wang,et al.  Dynamic implicit 3D adaptive mesh refinement for non-equilibrium radiation diffusion , 2013, J. Comput. Phys..

[14]  Guriĭ Ivanovich Marchuk,et al.  Numerical Methods in the Theory of Neutron Transport , 1986 .

[15]  Michael Lang,et al.  Entering the petaflop era: The architecture and performance of Roadrunner , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  William F. Godoy,et al.  On the use of flux limiters in the discrete ordinates method for 3D radiation calculations in absorbing and scattering media , 2010, J. Comput. Phys..

[17]  Haowei Huang,et al.  GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method , 2011, J. Comput. Phys..

[18]  Michael J. Antal,et al.  Charged particle mass and energy transport in a thermonuclear plasma , 1976 .

[19]  Jing Xie,et al.  Optimizing Sweep3D for Graphic Processor Unit , 2010, ICA3PP.

[20]  Scott Pakin,et al.  Entering the petaflop era: the architecture and performance of Roadrunner , 2008, HiPC 2008.