Asynchronous Task-Based Parallelization of Algebraic Multigrid

As processor clock rates become more dynamic and workloads become more adaptive, the vulnerability to global synchronization that already complicates programming for performance in today's petascale environment will be exacerbated. Algebraic multigrid (AMG), the solver of choice in many large-scale PDE-based simulations, scales well in the weak sense, with fixed problem size per node, on tightly coupled systems when loads are well balanced and core performance is reliable. However, its strong scaling to many cores within a node is challenging. Reducing synchronization and increasing concurrency are vital adaptations of AMG to hybrid architectures. Recent communication-reducing improvements to classical additive AMG by Vassilevski and Yang improve concurrency and increase communication-computation overlap, while retaining convergence properties close to those of standard multiplicative AMG, but remain bulk synchronous. We extend the Vassilevski and Yang additive AMG to asynchronous task-based parallelism using a hybrid MPI+OmpSs (from the Barcelona Supercomputer Center) within a node, along with MPI for internode communications. We implement a tiling approach to decompose the grid hierarchy into parallel units within task containers. We compare against the MPI-only BoomerAMG and the Auxiliary-space Maxwell Solver (AMS) in the hypre library for the 3D Laplacian operator and the electromagnetic diffusion, respectively. In time to solution for a full solve an MPI-OmpSs hybrid improves over an all-MPI approach in strong scaling at full core count (32 threads per single Haswell node of the Cray XC40) and maintains this per node advantage as both weak scale to thousands of cores, with MPI between nodes.

[1]  Stephen F. McCormick,et al.  Multilevel adaptive methods for partial differential equations , 1989, Frontiers in applied mathematics.

[2]  Gabriel Wittum,et al.  A massively parallel geometric multigrid solver on hierarchically distributed grids , 2013, Comput. Vis. Sci..

[3]  Panayot S. Vassilevski,et al.  Reducing communication in algebraic multigrid using additive variants , 2014, Numer. Linear Algebra Appl..

[4]  John Shalf,et al.  TiDA: High-Level Programming Abstractions for Data Locality Management , 2016, ISC.

[5]  Andreas Dedner,et al.  A generic grid interface for parallel and adaptive scientific computing. Part II: implementation and tests in DUNE , 2008, Computing.

[6]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  P. Wesseling A robust and efficient multigrid method , 1982 .

[8]  Jan Mandel,et al.  An algebraic theory for multigrid methods for variational problems , 1988 .

[9]  Pradeep Dubey,et al.  High-performance algebraic multigrid solver optimized for multi-core based distributed parallel systems , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[11]  George Ho,et al.  PAPI: A Portable Interface to Hardware Performance Counters , 1999 .

[12]  Hans De Sterck,et al.  Reducing Complexity in Parallel Algebraic Multigrid Preconditioners , 2004, SIAM J. Matrix Anal. Appl..

[13]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[14]  Panayot S. Vassilevski,et al.  Auxiliary Space AMG for H(curl) Problems , 2008 .

[15]  Hans De Sterck,et al.  Distance‐two interpolation for parallel algebraic multigrid , 2007, Numer. Linear Algebra Appl..

[16]  Rajeev Thakur,et al.  Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems , 2010, EuroMPI.

[17]  V. E. Henson,et al.  BoomerAMG: a parallel algebraic multigrid solver and preconditioner , 2002 .

[18]  Torsten Hoefler,et al.  MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory , 2013, Computing.

[19]  Ulrike Meier Yang,et al.  Parallel Algebraic Multigrid Methods — High Performance Preconditioners , 2006 .

[20]  Todd Gamblin,et al.  Scaling Algebraic Multigrid Solvers: On the Road to Exascale , 2010, CHPC.

[21]  Martin Schulz,et al.  Modeling the performance of an algebraic multigrid cycle on HPC platforms , 2011, ICS '11.

[22]  William F. Mitchell The Full Domain Partition Approach to Parallel Adaptive Refinement , 1999 .

[23]  John J. Cannon,et al.  The Magma Algebra System I: The User Language , 1997, J. Symb. Comput..

[24]  Jonathan J. Hu,et al.  ML 5.0 Smoothed Aggregation Users's Guide , 2006 .

[25]  D. Brandt,et al.  Multi-level adaptive solutions to boundary-value problems math comptr , 1977 .

[26]  David E. Keyes,et al.  KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators , 2014, ACM Trans. Math. Softw..

[27]  Thomas A. Manteuffel,et al.  Adaptive Algebraic Multigrid , 2005, SIAM J. Sci. Comput..

[28]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[29]  Xavier Lacoste,et al.  Scheduling and memory optimizations for sparse direct solver on multi-core/multi-gpu duster systems. (Ordonnancement et optimisations mémoire pour un solveur creux par méthodes directes sur des machines hétérogènes) , 2015 .