Compiler optimization of implicit reductions for distributed memory multiprocessors

This paper presents reduction recognition and parallel code generation strategies for distributed-memory multiprocessors. We describe techniques to recognize a broad range of implicit reduction operations, including those involving statements at multiple loop nesting levels and intermixed with conditional control flow. We introduce two new optimizations: factoring which increases data locality for SUM and PRODUCT reductions, and index encoding which enables a single global communication to accomplish both an extreme value reduction and an extreme value location reduction. We have implemented these techniques in the dHPF compiler for High Performance Fortran (HPF). We evaluate their effectiveness experimentally by compiling several reduction benchmarks with dHPF and two commercial HPF compilers, and comparing the performance of the generated code on an IBM SP2. Our results show that our recognition techniques are more powerful and that our index encoding and factoring optimizations can improve performance by a factor of two where they apply.

[1]  Lawrence Rauchwerger,et al.  Effective Automatic Parallelization with Polaris , 1995 .

[2]  Allan L. Fisher,et al.  Parallelizing complex scans and reductions , 1994, PLDI '94.

[3]  Rice UniversityCORPORATE,et al.  High performance Fortran language specification , 1993 .

[4]  Vikram S. Adve,et al.  HPF Analysis and Code Generation using Integer Sets , 1997 .

[5]  Carl Kesselman,et al.  Generalized communicators in the Message Passing Interface , 1996, Proceedings. Second MPI Developer's Conference.

[6]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[7]  Monica S. Lam,et al.  Communication optimization and code generation for distributed memory machines , 1993, PLDI '93.

[8]  Martin Charles Golumbic,et al.  Instruction Scheduling Across Control Flow , 1993, Sci. Program..

[9]  Rajeev Barua,et al.  Communication-Minimal Partitioning of Parallel Loops and Data Arrays for Cache-Coherent Distributed-Memory Multiprocessors , 1996, LCPC.

[10]  Toshio Nakatani,et al.  Detection and global optimization of reduction operations for distributed parallel machines , 1996, ICS '96.

[11]  Anne Rogers,et al.  Process decomposition through locality of reference , 1989, PLDI '89.

[12]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[13]  Guy L. Steele,et al.  The High Performance Fortran Handbook , 1993 .

[14]  Rudolf Eigenmann,et al.  Parallelization in the Presence of Generalized Induction and Reduction Variables , 1995 .

[15]  High Performance Fortran Forum High Performance Fortran: Language Specification (PART II) , 1994, FORF.

[16]  Monica S. Lam,et al.  Detecting Coarse - Grain Parallelism Using an Interprocedural Parallelizing Compiler , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[17]  Allan L. Fisher,et al.  Flattening and parallelizing irregular, recurrent loop nests , 1995, PPOPP '95.