Loop Striping: Maximize Parallelism for Nested Loops

The majority of scientific and Digital Signal Processing (DSP) applications are recursive or iterative. Transformation techniques are generally applied to increase parallelism for these nested loops. Most of the existing loop transformation techniques either can not achieve maximum parallelism, or can achieve maximum parallelism but with complicated loop bounds and loop indexes calculations. This paper proposes a new technique, loop striping, that can maximize parallelism while maintaining the original row-wise execution sequence with minimum overhead. Loop striping groups iterations into stripes, where a stripe is a group of iterations in which all iterations are independent and can be executed in parallel. Theorems and efficient algorithms are proposed for loop striping transformations. The experimental results show that loop striping always achieves better iteration period than software pipelining and loop unfolding, improving average iteration period by 50% and 54% respectively

[1]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[2]  Leslie Lamport,et al.  The parallel execution of DO loops , 1974, CACM.

[3]  Charles E. Leiserson,et al.  Retiming synchronous circuitry , 1988, Algorithmica.

[4]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[5]  Edwin Hsing-Mean Sha,et al.  Full Parallelism in Uniform Nested Loops Using Multi-Dimensional Retiming , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[6]  Richard M. Karp,et al.  A characterization of the minimum cycle mean in a digraph , 1978, Discret. Math..

[7]  Ken Kennedy,et al.  Automatic loop interchange , 2004, SIGP.

[8]  Monica S. Lam,et al.  Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.

[9]  Michael Wolfe,et al.  Loops skewing: The wavefront method revisited , 1986, International Journal of Parallel Programming.

[10]  Alexander Aiken,et al.  Optimal loop parallelization , 1988, PLDI '88.

[11]  Keshab K. Parhi,et al.  Static Rate-Optimal Scheduling of Iterative Data-Flow Programs via Optimum Unfolding , 1991, IEEE Trans. Computers.

[12]  Kazuo Iwano,et al.  An Efficient Algorithm for Optimal Loop Parallelization , 1990, SIGAL International Symposium on Algorithms.

[13]  Alex Aiken,et al.  Fine-grain parallelization and the wavefront method , 1990 .