An Experimental Investigation of Scalable Locality for Cluster Computing

Loop nest transformation has been used successfully to tune dense numerical codes for high performance on single- and multi-core shared-memory systems, but has not been widely applied to cluster computing. We have explored the use of these tools to produce the extremely high degree of memory locality needed to achieve high performance on a cluster with Intel’s Cluster OpenMP software. Our experiments show high performance across our dedicated homogeneous 56-core/14-node research cluster with gigabit Ethernet. With proper tuning, performance drops by less than a factor of two, and sometimes only a few percent, when the network speed is reduced to 100Mb/sec. These results indicate that properly chosen compile-time optimizations can be used for cluster computing, and illustrate the importance of scalable locality, which may be of interest to programmers developing cluster codes manually.

[1]  Rudolf Eigenmann,et al.  Optimizing OpenMP Programs on Software Distributed Shared Memory Systems , 2004, International Journal of Parallel Programming.

[2]  Robert A. van de Geijn,et al.  Satisfying your dependencies with SuperMatrix , 2007, 2007 IEEE International Conference on Cluster Computing.

[3]  David G. Wonnacott,et al.  Achieving Scalable Locality with Time Skewing , 2002, International Journal of Parallel Programming.

[4]  David G. Wonnacott,et al.  Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[5]  Ken Kennedy,et al.  Estimating Interlock and Improving Balance for Pipelined Architectures , 1988, J. Parallel Distributed Comput..

[6]  Rudolf Eigenmann,et al.  Towards OpenMP Execution on Software Distributed Shared Memory Systems , 2002, ISHPC.

[7]  Martin Griebl,et al.  Automatic code generation for distributed memory architectures in the polytope model , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[8]  Micah John Walter,et al.  Experiences with expressing and optimizing dense numerical algorithms in AlphaZ , 2011 .

[9]  Tim Douglas An Empirical Study of the Performance of Scalable Locality on a Distributed Shared Memory System , 2011 .

[10]  Uday Bondhugula,et al.  Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model , 2008, CC.

[11]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[12]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[13]  Ken Kennedy,et al.  Compiling Fortran D for MIMD distributed-memory machines , 1992, CACM.

[14]  John D. McCalpin,et al.  Time Skewing: A Value-Based Approach to Optimizing for Memory Locality , 1999 .

[15]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[16]  Rudolf Eigenmann,et al.  Towards automatic translation of OpenMP to MPI , 2005, ICS '05.

[17]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.