? SC (Or, Can Adding Scalable Locality to Distributed Shared Memory Yield SuperComputer Power?)

Distributed Shared Memory, such as that provided by Intel's Cluster OpenMP, lets program- mers treat the combined memory systems of a cluster of workstations as a single large address space. This relieves the programmer of the burden of explicitly transferring data: a correct OpenMP program should still work with Cluster OpenMP. However, by hiding data trans- fers, such systems also hide a major performance factor: correct OpenMP programs with poor locality-of-reference become correct but intolerably slow Cluster OpenMP programs. Scalable Locality describes the program property of locality that increases with problem size (just as Scalable Parallelism describes the property of parallelism that increases with problem size). In principle, the combination of an optimization that exposes scalable locality and a distributed shared memory system should yield a simple programming model with good performance on a cluster. We have begun to explore a combination of Cluster OpenMP and the Pluto research com- piler's implementation of time tiling, which can produce parallel programs with scalable locality from sequential loop-based dense matrix codes. In this article, we review our approach, discuss our performance model and its implica- tions for tile size selection, and present our most recent experimental tests of the viability of our approach and validity of our performance model. Our performance model captures only machine-independent issues that are critical to setting tile size. It deduces lower bounds on tile dimensions from a combination of purely hardware parameters (e.g. memory bandwidth) and parameters describing the software without reference to any particular hardware (e.g. number of live values produced by the loop nest). We also model load imbalance from OpenMP barriers, which is significant for smaller problems. Our results, while preliminary, are quite encouraging.

[1]  Rudolf Eigenmann,et al.  Optimizing OpenMP Programs on Software Distributed Shared Memory Systems , 2004, International Journal of Parallel Programming.

[2]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[3]  Rudolf Eigenmann,et al.  Towards OpenMP Execution on Software Distributed Shared Memory Systems , 2002, ISHPC.

[4]  Robert A. van de Geijn,et al.  Satisfying your dependencies with SuperMatrix , 2007, 2007 IEEE International Conference on Cluster Computing.

[5]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[6]  William Pugh,et al.  Iteration Space Slicing for Locality , 1999, LCPC.

[7]  Armin R. Mikler,et al.  NetPIPE: A Network Protocol Independent Performance Evaluator , 1996 .

[8]  Rudolf Eigenmann,et al.  Towards automatic translation of OpenMP to MPI , 2005, ICS '05.

[9]  Sanjay Rajopadhye,et al.  Positivity, posynomials and tile size selection , 2008, HiPC 2008.

[10]  John L. Gustafson,et al.  Reevaluating Amdahl's law , 1988, CACM.

[11]  V. Rich Personal communication , 1989, Nature.

[12]  Uday Bondhugula,et al.  Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model , 2008, CC.

[13]  Ken Kennedy,et al.  Compiling Fortran D for MIMD distributed-memory machines , 1992, CACM.

[14]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[15]  Greg Bronevetsky,et al.  Communication-Sensitive Static Dataflow for Parallel Message Passing Applications , 2009, 2009 International Symposium on Code Generation and Optimization.

[16]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[17]  David G. Wonnacott,et al.  Achieving Scalable Locality with Time Skewing , 2002, International Journal of Parallel Programming.

[18]  David G. Wonnacott,et al.  Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[19]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[20]  Martin Griebl,et al.  Automatic code generation for distributed memory architectures in the polytope model , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[21]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).