Loop fusion for clustered VLIW architectures

Embedded systems require maximum performance from a processor within significant constraints in power consumption and chip cost. Using software pipelining, high-performance digital signal processors can often exploit considerable instruction-level parallelism (ILP), and thus significantly improve performance. However, software pipelining, in some instances, hinders the goals of low power consumption and low chip cost. Specifically, the registers required by a software pipelined loop may exceed the size of the physical register set.The register pressure problem incurred by software pipelining makes it difficult to build a high-performance embedded processor with a single, multi-ported register bank with enough registers to support high levels of ILP while maintaining clock speed and limiting power consumption. The large number of ports required to support a single register bank severely hampers access time. The port requirement for a register bank can be reduced via hardware by partitioning the register bank into multiple banks connected to disjoint subsets of functional units, called clusters. Since a functional unit is not directly connected to all register banks, wasted energy and resources can result due to delays incurred when accessing "non-local" registers.The overhead due to partitioning of the register set can be ameliorated by using high-level compiler loop optimization techniques such as unrolling, unroll-and-jam and fusion. High-level loop optimizations spread data-independent parallelism across clusters that may not require "non-local" register accesses and can provide work to hide the latency of any such register accesses that are needed.In this paper, we examine the effects of loop fusion on DSP loops run on four simulated, clustered VLIW architectures and the Texas Instruments TMS320C64x. Our experiments show a 1.3 -- 2 harmonic mean speedup.

[1]  D.A. Reed,et al.  An Integrated Compilation and Performance Analysis Environment for Data Parallel Programs , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[2]  Philip H. Sweany,et al.  Register assignment for software pipelining with partitioned register banks , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[3]  Antonio González,et al.  Instruction scheduling for clustered VLIW architectures , 2000, ISSS '00.

[4]  Alexander Aiken,et al.  Optimal loop parallelization , 1988, PLDI '88.

[5]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[6]  Ken Kennedy,et al.  Improving the ratio of memory operations to floating-point operations in loops , 1994, TOPL.

[7]  Antonio González,et al.  The effectiveness of loop unrolling for modulo scheduling in clustered VLIW architectures , 2000, Proceedings 2000 International Conference on Parallel Processing.

[8]  Philip H. Sweany,et al.  Global register partitioning , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[9]  David A. Poplawski The unlimited resource machine (urm) , 1995 .

[10]  Philip H. Sweany,et al.  Loop Transformations for Architectures with Partitioned Register Banks , 2001, OM '01.

[11]  Ken Kennedy,et al.  RETROSPECTIVE: Coloring Heuristics for Register Allocation , 2022 .

[12]  Vicki H. Allan,et al.  Software pipelining , 1995, CSUR.

[13]  Alexandre E. Eichenberger,et al.  Effective cluster assignment for modulo scheduling , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[14]  M. Rajagopalan,et al.  Software Pipelining: Petri Net Pacemaker , 1993, Architectures and Compilation Techniques for Fine and Medium Grain Parallelism.

[15]  Thomas M. Conte,et al.  Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[16]  Philip H. Sweany,et al.  Value cloning for architectures with partitioned register banks , 1998 .

[17]  B. Ramakrishna Rau,et al.  Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.