Applying array contraction to a sequence of DOALL loops

Efficient program execution on multiprocessor computers requires both sufficient parallelism and good data locality. Recent research found that, using a combination of loop shifting, loop fusion, and array contraction, one can reduce the memory required to execute a sequence of serial loops, thereby to improve the cache locality. This paper studies how to extend such a memory-reduction scheme to a sequence of DOALL loops, which are executed in parallel on multiprocessors. Two methods are proposed to overcome difficulties caused by loop-carried dependences. Data copy-in is performed to remove anti-dependences between different parallel threads, and computation duplication is performed to remove flow dependences. Experiments performed on a number of benchmark programs show that the proposed technique improves both cache locality and parallel execution speed for the DOALL loops. The scheme achieves an average speedup of 1.41 for 17 programs on a 4-processor SUN machine.

[1]  Paul N. Hilfinger,et al.  Better Tiling and Array Contraction for Compiling Scientific Programs , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[2]  Ken Kennedy,et al.  Automatic decomposition of scientific programs for parallel execution , 1987, POPL '87.

[3]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[4]  Ken Kennedy,et al.  Advanced optimization strategies in the Rice dHPF compiler , 2002, Concurr. Comput. Pract. Exp..

[5]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[6]  Wei Li,et al.  Inter-procedural loop fusion, array contraction and rotation , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[7]  Yonghong Song,et al.  Improving data locality by array contraction , 2004, IEEE Transactions on Computers.

[8]  Cheng Wang,et al.  Data locality enhancement by memory reduction , 2001, ICS '01.

[9]  V. Sarkar,et al.  Collective Loop Fusion for Array Contraction , 1992, LCPC.

[10]  Frédéric Vivien,et al.  A unified framework for schedule and storage optimization , 2001, PLDI '01.

[11]  David L. Kuck,et al.  The Structure of Computers and Computations , 1978 .

[12]  Ken Kennedy,et al.  Improving effective bandwidth through compiler enhancement of global cache reuse , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[13]  Tarek S. Abdelrahman,et al.  Fusion of Loops for Parallelism and Locality , 1997, IEEE Trans. Parallel Distributed Syst..

[14]  Anne Mignotte,et al.  Loop alignment for memory accesses optimization , 1999, Proceedings 12th International Symposium on System Synthesis.

[15]  Monica S. Lam,et al.  Blocking and array contraction across arbitrarily nested loops using affine partitioning , 2001, PPoPP '01.

[16]  Kathryn S. McKinley,et al.  A Parametrized Loop Fusion Algorithm for Improving Parallelism and Cache Locality , 1997, Comput. J..

[17]  Larry Carter,et al.  Schedule-independent storage mapping for loops , 1998, ASPLOS VIII.