Dynamic Data Migration for Structured AMR Solvers

On cc-NUMA multi-processors, the non-uniformity of main memory latencies motivates the need for co-location of threads and data. We call this special form of data locality, geographical locality. In this article, we study the performance of a parallel PDE solver with adaptive mesh refinement (AMR). The solver is parallelized using OpenMP and the adaptive mesh refinement makes dynamic load balancing necessary. Due to the dynamically changing memory access pattern caused by the runtime adaption, it is a challenging task to achieve a high degree of geographical locality. The main conclusions of the study are: (1) that geographical locality is very important for the performance of the solver, (2) that the performance can be improved significantly using dynamic page migration of misplaced data, (3) that a migrate-on-next-touch directive works well whereas the first-touch strategy is less advantageous for programs exhibiting a dynamically changing memory access patterns, and (4) that the overhead for such migration is low compared to the total execution time.

[1]  John B. Bell,et al.  Parallelization of structured, hierarchical adaptive mesh refinement algorithms , 2000 .

[2]  Rainer Grauer,et al.  Racoon: A parallel mesh-adaptive framework for hyperbolic conservation laws , 2005, Parallel Comput..

[3]  Erik Hagersten Performance of PDE Solvers on a Self-Optimizing NUMA Architecture , 2003 .

[4]  Chris Johnson,et al.  Data Distribution , Migration and Replication on a cc-NUMA Architecture , 2002 .

[5]  K.M. Wilson,et al.  Dynamic Page Placement to Improve Locality in CC-NUMA Multiprocessors for TPC-C , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[6]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[7]  Jeffrey K. Hollingsworth,et al.  Using Hardware Counters to Automatically Improve Memory Performance , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[8]  Jarmo Rantakokko Partitioning strategies for structured multiblock grids , 2000, Parallel Comput..

[9]  Michael Thuné,et al.  A Comparison of Partitioning Schemes for Blockwise Parallel SAMR Algorithms , 2000, PARA.

[10]  Sverker Holmgren,et al.  affinity-on-next-touch: increasing the performance of an industrial PDE solver on a cc-NUMA system , 2005, ICS '05.

[11]  James C. Browne,et al.  Systems Engineering for High Performance Computing Software: The HDDA/DAGH Infrastructure for Implementation of Parallel Structured Adaptive Mesh , 2000 .

[12]  Jesús Labarta,et al.  Evaluation of the memory page migration influence in the system performance: the case of the SGI O2000 , 2003, ICS '03.

[13]  Dinshaw S. Balsara,et al.  Highly parallel structured adaptive mesh refinement using parallel language-based approaches , 2001, Parallel Comput..

[14]  Dirk Roose,et al.  DRAMA: A Library for Parallel Dynamic Load Balancing of Finite Element Applications , 1999, PPSC.

[15]  Jarmo Rantakokko,et al.  Comparison of Parallelization Models for Structured Adaptive Mesh Refinement , 2004, Euro-Par.

[16]  Tor Sørevik,et al.  Load balancing and OpenMP implementation of nested parallelism , 2005, Parallel Comput..

[17]  Patricia J. Teller Translation-lookaside buffer consistency , 1990, Computer.

[18]  Jonathan Harris,et al.  Extending OpenMP For NUMA Machines , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[19]  Sverker Holmgren,et al.  Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers , 2004, International Conference on Computational Science.

[20]  Martin G. Everett,et al.  Parallel Dynamic Graph Partitioning for Adaptive Unstructured Meshes , 1997, J. Parallel Distributed Comput..

[21]  Ralf Deiterding,et al.  Construction and Application of an AMR Algorithm for Distributed Memory Computers , 2005 .

[22]  Dieter an Mey,et al.  Hybrid Parallelization with Dynamic Thread Balancing on a ccNUMA System , 2006 .

[23]  Scott R. Kohn,et al.  Large scale parallel structured AMR calculations using the SAMRAI framework , 2001, SC.

[24]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[25]  Vipin Kumar,et al.  A Unified Algorithm for Load-balancing Adaptive Scientific Simulations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[26]  Per Lötstedt,et al.  Space–Time Adaptive Solution of First Order PDES , 2006, J. Sci. Comput..