Compiler Techniques for the Distribution of Data and Computation

This paper presents a new method that can be applied by a parallelizing compiler to find, without user intervention, the iteration and data decompositions that minimize communication and load imbalance overheads in parallel programs targeted at NUMA architectures. One of the key ingredients in our approach is the representation of locality as a locality-communication graph (ICG) and the formulation of the compiler technique as a mixed integer nonlinear programming (MINLP) optimization problem on this graph. The objective function and constraints of the optimization problem model communication costs and load imbalance. The solution to this optimization problem is a decomposition that minimizes the parallel execution overhead. This paper summarizes the process of how the compiler extracts the locality information from a nonannotated code and focuses on how this compiler can derive the optimization problem, solve it, and generate the parallel code with the automatically selected iteration and data distributions. In addition, we include a discussion about our model and the solutions - the decompositions - that it provides. The approach presented in the paper is evaluated using several benchmarks. The experimental results demonstrate that the MINLP formulation does not increase compilation time significantly and that our framework generates very efficient iteration/data distributions for a variety of NUMA machines.

[1]  Paul Feautrier,et al.  Direct parallelization of call statements , 1986, SIGPLAN '86.

[2]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[3]  Jay Hoeflinger,et al.  Interprocedural parallelization using memory classification analysis , 1998 .

[4]  Mahmut T. Kandemir,et al.  An integer linear programming approach for optimizing cache locality , 1999, ICS '99.

[5]  Monica S. Lam,et al.  Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.

[6]  Leonid Oliker,et al.  Algorithms for Automatic Alignment of Arrays , 1996, J. Parallel Distributed Comput..

[7]  William Pugh,et al.  A practical algorithm for exact array dependence analysis , 1992, CACM.

[8]  Emilio L. Zapata,et al.  An Automatic Iteration/Data Distribution Method Based on Access Descriptors for DSMM , 1999, LCPC.

[9]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[10]  Eduard Ayguadé,et al.  A case for user-level dynamic page migration , 2000, ICS '00.

[11]  Robert J. Harrison,et al.  Global Arrays: a portable "shared-memory" programming model for distributed memory computers , 1994, Proceedings of Supercomputing '94.

[12]  Ken Kennedy,et al.  Automatic Data Layout Using 0-1 Integer Programming , 1994, IFIP PACT.

[13]  Keshav Pingali,et al.  Solving Alignment Using Elementary Linear Algebra , 2001, Compiler Optimizations for Scalable Parallel Systems Languages.

[14]  Yunheung Paek,et al.  Simplification of array access patterns for compiler optimizations , 1998, PLDI.

[15]  David A. Padua,et al.  Access descriptor based locality analysis for Distributed-Shared Memory multiprocessors , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[16]  Mahmut T. Kandemir,et al.  A graph based framework to detect optimal memory layouts for improving data locality , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[17]  Evangelos P. Markatos,et al.  Shared memory vs. message passing in shared-memory multiprocessors , 1992, [1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing.

[18]  Eduard Ayguade,et al.  Dynamic data distribution with control flow analysis , 1996, Supercomputing '96.

[19]  I. Grossmann,et al.  A combined penalty function and outer-approximation method for MINLP optimization : applications to distillation column design , 1989 .

[20]  Yunheung Paek,et al.  An Advanced Compiler Framework for Non-Cache-Coherent Multiprocessors , 2002, IEEE Trans. Parallel Distributed Syst..

[21]  Yunheung Paek,et al.  Parallel Programming with Polaris , 1996, Computer.

[22]  Barbara M. Chapman,et al.  Performance Oriented Programming for NUMA Architectures , 2001, WOMPAT.