论文信息 - Compiling affine loop nests for distributed-memory parallel architectures

Compiling affine loop nests for distributed-memory parallel architectures

We present new techniques for compilation of arbitrarily nested loops with affine dependences for distributed-memory parallel architectures. Our framework is implemented as a source-level transformer that uses the polyhedral model, and generates parallel code with communication expressed with the Message Passing Interface (MPI) library. Compared to all previous approaches, ours is a significant advance either (1) with respect to the generality of input code handled, or (2) efficiency of communication code, or both. We provide experimental results on a cluster of multicores demonstrating its effectiveness. In some cases, code we generate outperforms manually parallelized codes, and in another case is within 25% of it. To the best of our knowledge, this is the first work reporting end-to-end fully automatic distributed-memory parallelization and code generation for input programs and transformation techniques as general as those we allow.

Uday Bondhugula

[1] Feng Liu,et al. Scalable Speculative Parallelization on Commodity Clusters , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[2] Monica S. Lam,et al. Communication optimization and code generation for distributed memory machines , 1993, PLDI '93.

[3] Ken Kennedy,et al. Automatic data layout for distributed-memory machines , 1998, TOPL.

[4] John A. Chandy,et al. The Paradigm Compiler for Distributed-Memory Multicomputers , 1995, Computer.

[5] Jingling Xue,et al. Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[6] Paul Feautrier,et al. Some efficient solutions to the affine scheduling problem. I. One-dimensional time , 1992, International Journal of Parallel Programming.

[7] Peiyi Tang,et al. Reducing data communication overhead for DOACROSS loop nests , 1994, ICS '94.

[8] Sven Verdoolaege,et al. isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.

[9] Uday Bondhugula,et al. Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors , 2009, PPoPP '09.

[10] Cédric Bastoul,et al. Productivity via Automatic Code Generation for PGAS Platforms with the R-Stream Compiler , 2009 .

[11] Robert J. Fowler,et al. Generalized multipartitioning of multi-dimensional arrays for parallelizing line-sweep computations , 2003, J. Parallel Distributed Comput..

[12] Bradley C. Kuszmaul,et al. The pochoir stencil compiler , 2011, SPAA '11.

[13] J. Ramanujam,et al. Compile-Time Techniques for Data Distribution in Distributed Memory Machines , 1991, IEEE Trans. Parallel Distributed Syst..

[14] Rudolf Eigenmann,et al. A hybrid approach of OpenMP for clusters , 2012, PPoPP '12.

[15] Armin Größlinger. Precise Management of Scratchpad Memories for Localising Array Accesses in Scientific Codes , 2009, CC.

[16] Uday Bondhugula,et al. Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model , 2008, CC.

[17] Uday Bondhugula,et al. Combined iterative and model-driven optimization in an automatic parallelization framework , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[18] Monica S. Lam,et al. Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.

[19] Monica S. Lam,et al. Maximizing Parallelism and Minimizing Synchronization with Affine Partitions , 1998, Parallel Comput..

[20] Nectarios Koziris,et al. Message-passing code generation for non-rectangular tiling transformations , 2006, Parallel Comput..

[21] Martin Griebl,et al. Automatic Parallelization of Loop Programs for Distributed Memory Architectures , 2004 .

[22] Uday Bondhugula,et al. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories , 2008, PPoPP.

[23] Monica S. Lam,et al. Data and computation transformations for multiprocessors , 1995, PPOPP '95.

[24] Message Passing Interface Forum. MPI: A message - passing interface standard , 1994 .

[25] Monica S. Lam,et al. An affine partitioning algorithm to maximize parallelism and minimize communication , 1999, ICS '99.

[26] Martin Griebl,et al. Automatic code generation for distributed memory architectures in the polytope model , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[27] David Parello,et al. Facilitating the search for compositions of program transformations , 2005, ICS '05.

[28] Uday Bondhugula,et al. Generating efficient data movement code for heterogeneous architectures with distributed-memory , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[29] Vikram S. Adve,et al. Using integer sets for data-parallel program analysis and optimization , 1998, PLDI.

[30] Ken Kennedy,et al. Advanced optimization strategies in the Rice dHPF compiler , 2002, Concurr. Comput. Pract. Exp..

[31] Uday Bondhugula,et al. A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[32] Monica S. Lam,et al. Blocking and array contraction across arbitrarily nested loops using affine partitioning , 2001, PPoPP '01.