Unrolling Loops Containing Task Parallelism

Classic loop unrolling allows to increase the performance of sequential loops by reducing the overheads of the non-computational parts of the loop. Unfortunately, when the loop contains parallelism inside most compilers will ignore it or perform a naive transformation. We propose to extend the semantics of the loop unrolling transformation to cover loops that contain task parallelism. In these cases, the transformation will try to aggregate the multiple tasks that appear after a classic unrolling phase to reduce the overheads per iteration. We present an implementation of such extended loop unrolling for OpenMP tasks with two phases: a classical unroll followed by a task aggregation phase. Our aggregation technique covers the special cases where task parallelism appears inside branches or where the loop is uncountable. Our experimental results show that using this extended unroll allows loops with fine-grained tasks to reduce the overheads associated with task creation and obtain a much better scaling.

[1]  Rosa M. Badia,et al.  A Flexible and Portable Programming Model for SMP and Multi-cores BSC-UPC COMPUTER SCIENCES PROGRAM , 2007 .

[2]  William Pugh,et al.  Uniform techniques for loop optimization , 1991, ICS '91.

[3]  David Parello,et al.  Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies , 2006, International Journal of Parallel Programming.

[4]  Tao Yang,et al.  On the Granularity and Clustering of Directed Acyclic Task Graphs , 1993, IEEE Trans. Parallel Distributed Syst..

[5]  Hironori Kasahara,et al.  Coarse-Grain Task Parallel Processing Using the OpenMP Backend of the OSCAR Multigrain Parallelizing Compiler , 2000, ISHPC.

[6]  Ken Kennedy,et al.  Automatic translation of FORTRAN programs to vector form , 1987, TOPL.

[7]  Monica S. Lam,et al.  Detecting Coarse - Grain Parallelism Using an Interprocedural Parallelizing Compiler , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[8]  Mary W. Hall,et al.  Detecting Coarse - Grain Parallelism Using an Interprocedural Parallelizing Compiler , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[9]  Alejandro Duran,et al.  Support for OpenMP tasks in Nanos v4 , 2007, CASCON.

[10]  David A. Padua,et al.  A Language for the Compact Representation of Multiple Program Versions , 2005, LCPC.

[11]  Mary W. Hall,et al.  CHiLL : A Framework for Composing High-Level Loop Transformations , 2007 .

[12]  Carolyn McCreary,et al.  Automatic determination of grain size for efficient parallel processing , 1989, CSC '89.