Tasks Unlimited: Lightweight Task Offloading Exploiting MPI Wait Times for Parallel Adaptive Mesh Refinement

Balancing dynamically adaptive mesh refinement (AMR) codes is inherently difficult, since codes have to balance both computational workload and memory footprint over meshes that can change any time, while modern supercomputers and their interconnects start to exhibit fluctuating performance. We propose a novel lightweight scheme for MPI+X which complements traditional balancing. It is a reactive diffusion approach which uses online measurements of MPI idle time to migrate tasks from overloaded to underemployed ranks. Tasks are deployed to ranks which otherwise would wait, processed with high priority, and made available to the overloaded ranks again. They are temporarily migrated. Our approach hijacks idle time to do meaningful work and is totally non-blocking, asynchronous and distributed without a global data view. Tests with a seismic simulation code running an explicit high order ADER-DG scheme (developed in the ExaHyPE engine, this http URL) uncover the method's potential. We found speed-ups of up to 2-3 for ill-balanced scenarios without logical modifications of the code base.

[1]  Jannis Klinkenberg,et al.  Reactive Task Migration for Hybrid MPI+OpenMP Applications , 2019, PPAM.

[2]  Abhinav Bhatele,et al.  Evaluation of an Interference-free Node Allocation Policy on Fat-tree Clusters , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Abhishek Gupta,et al.  Parallel Programming with Migratable Objects: Charm++ in Practice , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Hartmut Kaiser,et al.  HPX: A Task Based Programming Model in a Global Address Space , 2014, PGAS.

[5]  Hatem Ltaief,et al.  Abstraction Layer For Standardizing APIs of Task-Based Engines , 2020, IEEE Transactions on Parallel and Distributed Systems.

[6]  Michael Dumbser,et al.  ExaHyPE: An Engine for Parallel Dynamically Adaptive Simulations of Wave Problems , 2019, Comput. Phys. Commun..

[7]  Misbah Mubarak,et al.  Predicting the Performance Impact of Different Fat-Tree Configurations , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Tobias Weinzierl,et al.  The Peano Software—Parallel, Automaton-based, Dynamically Adaptive Grid Traversals , 2015, ACM Trans. Math. Softw..

[9]  John D. McCalpin HPL and DGEMM Performance Variability on the Xeon Platinum 8160 Processor , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Tobias Weinzierl,et al.  Stop talking to me - a communication-avoiding ADER-DG realisation , 2018, ArXiv.

[11]  Josef Weidendorfer,et al.  Real Asynchronous MPI Communication in Hybrid Codes through OpenMP Communication Tasks , 2013, 2013 International Conference on Parallel and Distributed Systems.

[12]  Jannis Klinkenberg,et al.  Hybrid MPI+OpenMP Reactive Work Stealing in Distributed Memory in the PDE Framework sam(oa)^2 , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[13]  Cy P. Chan,et al.  Semi-Static and Dynamic Load Balancing for Asynchronous Hurricane Storm Surge Simulations , 2018, 2018 IEEE/ACM Parallel Applications Workshop, Alternatives To MPI (PAW-ATM).

[14]  Cevdet Aykanat,et al.  Fast optimal load balancing algorithms for 1D partitioning , 2004, J. Parallel Distributed Comput..

[15]  Torsten Hoefler,et al.  Message progression in parallel computing - to thread or not to thread? , 2008, 2008 IEEE International Conference on Cluster Computing.

[16]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[17]  Steven G. Parker,et al.  Uintah: a massively parallel problem solving environment , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[18]  Stefan Wallner,et al.  On-the-fly memory compression for multibody algorithms , 2015, PARCO.

[19]  Anshu Dubey,et al.  Parallel algorithms for moving Lagrangian data on block structured Eulerian meshes , 2011, Parallel Comput..

[20]  Heiner Igel,et al.  Computational Seismology: A Practical Introduction , 2017 .

[21]  Marc-André Hermanns,et al.  Catching Idlers with Ease: A Lightweight Wait-State Profiler for MPI Programs , 2014, EuroMPI/ASIA.

[22]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[23]  James Charles,et al.  Evaluation of the Intel® Core™ i7 Turbo Boost feature , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[24]  Michael Dumbser,et al.  Space–time adaptive ADER discontinuous Galerkin finite element schemes with a posteriori sub-cell finite volume limiting , 2014, 1412.0081.

[25]  Tobias Weinzierl,et al.  Enclave Tasking for Discontinuous Galerkin Methods on Dynamically Adaptive Meshes , 2018, SIAM J. Sci. Comput..

[26]  Hari Sundar,et al.  A Nested Partitioning Algorithm for Adaptive Meshes on Heterogeneous Clusters , 2015, ICS.

[27]  Michael Dumbser,et al.  Studies on the energy and deep memory behaviour of a cache-oblivious, task-based hyperbolic PDE solver , 2018, Int. J. High Perform. Comput. Appl..

[28]  Justin Luitjens,et al.  Dynamic task scheduling for the Uintah framework , 2010, 2010 3rd Workshop on Many-Task Computing on Grids and Supercomputers.

[29]  Laxmikant V. Kalé,et al.  Variation Among Processors Under Turbo Boost in HPC Systems , 2016, ICS.

[30]  Pedro Gonnet,et al.  SWIFT: Using Task-Based Parallelism, Fully Asynchronous Communication, and Graph Partition-Based Domain Decomposition for Strong Scaling on more than 100,000 Cores , 2016, PASC.

[31]  M. Dumbser,et al.  An arbitrary high-order discontinuous Galerkin method for elastic waves on unstructured meshes — II. The three-dimensional isotropic case , 2006 .