Optimizing parallelism for nested loops with iterational and instructional retiming

Embedded systems have strict timing and code size requirements. Retiming is one of the most important optimization techniques to improve the execution time of loops by increasing the parallelism among successive loop iterations. Traditionally, retiming has been applied at instruction level to reduce cycle period for single loops. While multi-dimensional (MD) retiming can explore the outer loop parallelism, it introduces large overheads in loop index generation and code size due to loop transformation. In this paper, we propose a novel approach, that combines iterational retiming with instructional retiming to satisfy any given timing constraint by achieving full parallelism for iterations in a partition with minimal code size. The experimental results show that combining iterational retiming and instructional retiming, we can achieve 37% code size reduction comparing to applying iteration retiming alone.

[1]  Edwin Hsing-Mean Sha,et al.  Full Parallelism in Uniform Nested Loops Using Multi-Dimensional Retiming , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[2]  Markku Renfors,et al.  The maximum sampling rate of digital filters under hardware speed constraints , 1981 .

[3]  Edwin Hsing-Mean Sha,et al.  Loop scheduling and partitions for hiding memory latencies , 1999, Proceedings 12th International Symposium on System Synthesis.

[4]  David F. Bacon,et al.  Compiler transformations for high-performance computing , 1994, CSUR.

[5]  Edwin Hsing-Mean Sha,et al.  Rotation Scheduling: A Loop Pipelining Algorithm , 1993, 30th ACM/IEEE Design Automation Conference.

[6]  Charles E. Leiserson,et al.  Retiming synchronous circuitry , 1988, Algorithmica.

[7]  Alex Aiken,et al.  Fine-grain parallelization and the wavefront method , 1990 .

[8]  Edwin Hsing-Mean Sha,et al.  Rate-optimal static scheduling for DSP data-flow programs , 1993, [1993] Proceedings Third Great Lakes Symposium on VLSI-Design Automation of High Performance VLSI Systems.

[9]  Leslie Lamport,et al.  The parallel execution of DO loops , 1974, CACM.

[10]  Edwin Hsing-Mean Sha,et al.  Scheduling and partitioning for multiple loop nests , 2001, International Symposium on System Synthesis (IEEE Cat. No.01EX526).

[11]  Alexander Aiken,et al.  Optimal loop parallelization , 1988, PLDI '88.