Performance Modeling of Gyrokinetic Toroidal Simulations for a Many-Tasking Runtime System

Conventional programming practices on multicore processors in high performance computing architectures are not universally effective in terms of efficiency and scalability for many algorithms in scientific computing. One possible solution for improving efficiency and scalability in applications on this class of machines is the use of a many-tasking runtime system employing many lightweight, concurrent threads. Yet a priori estimation of the potential performance and scalability impact of such runtime systems on existing applications developed around the bulk synchronous parallel (BSP) model is not well understood. In this work, we present a case study of a BSP particle-in-cell benchmark code which has been ported to a many-tasking runtime system. The 3-D Gyrokinetic Toroidal code (GTC) is examined in its original MPI form and compared with a port to the High Performance ParalleX 3 (HPX-3) runtime system. Phase overlap, oversubscription behavior, and work rebalancing in the implementation are explored. Results for GTC using the SST/macro simulator complement the implementation results. Finally, an analytic performance model for GTC is presented in order to guide future implementation efforts.

[1]  Douglas Thain,et al.  Qthreads: An API for programming with millions of lightweight threads , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[2]  Arch D. Robison,et al.  Structured Parallel Programming: Patterns for Efficient Computation , 2012 .

[3]  Gilbert Hendry,et al.  SST: A Simulator for Exascale Co-design. , 2012 .

[4]  M. Brodowicz,et al.  Application Characteristics of Many-tasking Execution Models , 2013 .

[5]  Victor Luchangco,et al.  The Fortress Language Specification Version 1.0 , 2007 .

[6]  Guang R. Gao,et al.  ParalleX: A Study of A New Parallel Computation Model , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[7]  Henry G. Baker,et al.  Actors and Continuous Functionals , 1978, Formal Description of Programming Concepts.

[8]  Samuel Williams,et al.  Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[9]  John Shalf,et al.  NERSC-6 Workload Analysis and Benchmark Selection Process , 2008 .

[10]  A. Lumsdaine,et al.  LogGOPSim: simulating large-scale applications in the LogGOPS model , 2010, HPDC '10.

[11]  Robert H. Halstead,et al.  MULTILISP: a language for concurrent symbolic computation , 1985, TOPL.

[12]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[13]  B. J. Muga,et al.  Particle-in-Cell Method , 1970 .

[14]  Alice Koniges,et al.  Application Acceleration on Current and Future Cray Platforms , 2010 .

[15]  Martin Schulz,et al.  Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[16]  Stephen L. Olivier,et al.  Comparison of OpenMP 3.0 and Other Task Parallel Frameworks on Unbalanced Task Graphs , 2010, International Journal of Parallel Programming.

[17]  Arch D. Robison,et al.  Chapter 3 – Patterns , 2012 .

[18]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[19]  S. Ethier,et al.  Gyrokinetic particle-in-cell simulations of plasma microturbulence on advanced computing platforms , 2005 .

[20]  Roger W. Hockney,et al.  The Communication Challenge for MPP: Intel Paragon and Meiko CS-2 , 1994, Parallel Computing.

[21]  Jack J. Dongarra,et al.  Performance analysis of MPI collective operations , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[22]  Xingfu Wu,et al.  Performance Modeling of Hybrid MPI/OpenMP Scientific Applications on Large-scale Multicore Cluster Systems , 2011, 2011 14th IEEE International Conference on Computational Science and Engineering.

[23]  Tarek El-Ghazawi,et al.  Evaluation of UPC on the Cray X1 , 2005 .

[24]  Torsten Hoefler,et al.  Performance modeling for systematic performance tuning , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[25]  Mark M. Mathis,et al.  A performance model of non-deterministic particle transport on large-scale systems , 2003, Future Gener. Comput. Syst..

[26]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[27]  Michael Haupt,et al.  A comparison of context-oriented programming languages , 2009, COP@ECOOP.

[28]  John M. Mellor-Crummey,et al.  Managing Asynchronous Operations in Coarray Fortran 2.0 , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[29]  Rajeev Thakur,et al.  Hybrid parallel programming with MPI and unified parallel C , 2010, Conf. Computing Frontiers.

[30]  Thomas L. Sterling,et al.  Improving the scalability of parallel N-body applications with an event-driven constraint-based execution model , 2012, Int. J. High Perform. Comput. Appl..

[31]  Gilbert Hendry Decreasing Network Power with on-off Links Informed by Scientific Applications , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[32]  Thomas L. Sterling,et al.  ParalleX An Advanced Parallel Execution Model for Scaling-Impaired Applications , 2009, 2009 International Conference on Parallel Processing Workshops.

[33]  Steven A. Hofmeyr,et al.  Oversubscription on multicore processors , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[34]  Bruno Raffin,et al.  XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[35]  Franck Cappello,et al.  MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[36]  Alejandro Duran,et al.  Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP , 2009, 2009 International Conference on Parallel Processing.