Performance modelling for parallel PDE solvers on NUMA-systems

A trend in parallel computer architecture is that systems with a large shared memory are becoming more and more popular. A shared memory system can be either a uniform memory architecture (UMA) or a cache coherent non-uniform memory architecture (cc-NUMA). In the present thesis, the performance of parallel PDE solvers on cc-NUMA computers is studied. In particular, we consider the shared namespace programming model, represented by OpenMP. Since the main memory is physically, or geographically distributed over several multi-processor nodes, the latency for local memory accesses is smaller than for remote accesses. Therefore, the geographical locality of the data becomes important. The focus of the present thesis is to study multithreaded PDE solvers on cc-NUMA systems, in particular their memory access pattern with respect to geographical locality. The questions posed are: (1) How large is the influence on performance of the non-uniformity of the memory system? (2) How should a program be written in order to reduce this influence? (3) Is it possible to introduce optimizations in the computer system for this purpose? The main conclusion is that geographical locality is important for performance on cc-NUMA systems. This is shown experimentally for a broad range of PDE solvers as well as theoretically using a model involving characteristics of computer systems and applications. Geographical locality can be achieved through migration directives that are inserted by the programmer or — possibly in the future — automatically by the compiler. On some systems, it can also be accomplished by means of transparent, hardware initiated migration and replication. However, a necessary condition that must be fulfilled if migration is to be effective is that the memory access pattern must not be "speckled", i.e. as few threads as possible shall make accesses to each memory page. We also conclude that OpenMP is competitive with MPI on cc-NUMA systems if care is taken to get a favourable data distribution.

[1]  Sharad Malik,et al.  Cache miss equations: a compiler framework for analyzing and tuning memory behavior , 1999, TOPL.

[2]  Hong Linh Truong,et al.  On Using SCALEA for Performance Analysis of Distributed and Parallel Programs , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[3]  Erik Hagersten,et al.  A statistical multiprocessor cache model , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[4]  Erik Hagersten,et al.  StatCache: a probabilistic approach to efficient and accurate data locality analysis , 2004, IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004.

[5]  Wolfgang E. Nagel,et al.  Group-Based Performance Analysis for Multithreaded SMP Cluster Applications , 2001, Euro-Par.

[6]  Dimitrios S. Nikolopoulos,et al.  A transparent runtime data distribution engine for OpenMP , 2000 .

[7]  Zhao Zhang,et al.  Performance Modeling and Tuning Strategies of Mixed Mode Collective Communications , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[8]  Erik Hagersten,et al.  WildFire: a scalable path for SMPs , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[9]  Jonathan Harris,et al.  Extending OpenMP For NUMA Machines , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[10]  William Gropp,et al.  High-performance parallel implicit CFD , 2001, Parallel Comput..

[11]  Leslie Greengard,et al.  A fast algorithm for particle simulations , 1987 .

[12]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[13]  A. Charlesworth The Sun Fireplane System Interconnect , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[14]  H. H. Rachford,et al.  The Numerical Solution of Parabolic and Elliptic Differential Equations , 1955 .

[15]  Sverker Holmgren,et al.  Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers , 2004, International Conference on Computational Science.

[16]  Jeffrey K. Hollingsworth,et al.  Using Hardware Performance Monitors to Isolate Memory Bottlenecks , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[17]  Jesús Labarta,et al.  Generation of Simple Analytical Models for Message Passing Applications , 2004, Euro-Par.

[18]  Alan George,et al.  Computer Solution of Large Sparse Positive Definite , 1981 .

[19]  Mark Horowitz,et al.  An analytical cache model , 1989, TOCS.

[20]  Sverker Holmgren,et al.  Performance of PDE solvers on a self-optimizing NUMA architecture , 2002, Parallel Algorithms Appl..

[21]  E. Ayguade,et al.  Scaling Irregular Parallel Codes with Minimal Programming Effort , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[22]  Felix Wolf,et al.  CATCH - A Call-Graph Based Automatic Tool for Capture of Hardware Performance Metrics for MPI and OpenMP Applications , 2002, Euro-Par.

[23]  Jeffrey K. Hollingsworth,et al.  SIGMA: A Simulator Infrastructure to Guide Memory Analysis , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[24]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[25]  Sverker Holmgren,et al.  Analyzing Advanced PDE Solvers Through Simulation , 2004, PARA.

[26]  Fabrizio Petrini,et al.  Predictive Performance and Scalability Modeling of a Large-Scale Application , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[27]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[28]  Csaba Andras Moritz,et al.  Performance Modeling and Evaluation of MPI , 2001, J. Parallel Distributed Comput..

[29]  Sverker Holmgren,et al.  Geographical Locality and Dynamic Data Migration for OpenMP Implementations of Adaptive PDE Solvers , 2006, IWOMP.

[30]  Erik Hagersten,et al.  SIP: Performance Tuning through Source Code Interdependence , 2002, Euro-Par.

[31]  John M. Mellor-Crummey,et al.  Cross-architecture performance predictions for scientific applications using parameterized models , 2004, SIGMETRICS '04/Performance '04.

[32]  Eduard Ayguadé,et al.  Is Data Distribution Necessary in OpenMP? , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[33]  Lisa Noordergraaf,et al.  Performance experiences on Sun's Wildfire prototype , 1999, SC '99.