Multigrid and Gauss-Seidel smoothers revisited: parallelization on chip multiprocessors

Efficient solution of partial differential equations require a match between the algorithm and the target architecture. Many recent chip multiprocessors, CMPs (a.k.a. multi-core), feature low intra-thread communication costs and smaller per-thread caches compared to previous shared memory multi-processor systems. From an algorithmic point of view this means that data locality issues become more important than communication overheads. A fact that may require a re-evaluation of many existing algorithms.We have investigated parallel implementations of multi-grid methods using a parallel temporally blocked, naturally ordered smoother. Compared to the standard multigrid solution based on a red-black ordering, we improve the data locality often as much as ten times, while our use of a fine-grained locking scheme keeps the parallel efficiency high.Our algorithm was initially inspired by CMPs and it was surprising to see that our OpenMP multigrid implementation ran up to 40 percent faster than the standard red-black algorithm on a contemporary 8-way SMP system. Thanks to the temporal blocking introduced, our smoother implementation often allowed us to apply the smoother two times at the same cost as a single application of a red-black smoother. By executing our smoother on a 32-thread UltraSPARC T1 (Niagara) SMT/CMP and a simulated 32-way CMP we demonstrate that such architectures can tolerate the increased communication costs implied by the tradeoffs made in our implementation.

[1]  Edmond Chow,et al.  A Survey of Parallelization Techniques for Multigrid Solvers , 2006, Parallel Processing for Scientific Computing.

[2]  Dean M. Tullsen,et al.  Simultaneous multithreading: a platform for next-generation processors , 1997, IEEE Micro.

[3]  Arnold Reusken,et al.  Introduction to Multigrid Methods for Elliptic Boundary Value Problems , 2008 .

[4]  Andreas Frommer,et al.  Block colouring schemes for the SOR method on local memory parallel computers , 1990, Parallel Comput..

[5]  Siddhartha Chatterjee,et al.  Cache-Efficient Multigrid Algorithms , 2004, Int. J. High Perform. Comput. Appl..

[6]  Ulrich Rüde,et al.  Memory Characteristics of Iterative Methods , 1999, SC.

[7]  P. Wesseling An Introduction to Multigrid Methods , 1992 .

[8]  Harry F. Jordan,et al.  A parallelized point rowwise successive over-relaxation method on a multiprocessor , 1984, Parallel Comput..

[9]  Erik Hagersten,et al.  VASA: A Simulator Infrastructure with Adjustable Fidelity , 2005, IASTED PDCS.

[10]  J. Ortega,et al.  A multi-color SOR method for parallel computation , 1982, ICPP.

[11]  Chang-Ock Lee,et al.  A parallel Gauss - Seidel method using NR data flow ordering , 1999, Appl. Math. Comput..

[12]  Louis A. Hageman,et al.  Iterative Solution of Large Linear Systems. , 1971 .

[13]  Erik Hagersten,et al.  StatCache: a probabilistic approach to efficient and accurate data locality analysis , 2004, IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004.

[14]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[15]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[16]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).