Optimizing loop performance for clustered VLIW architectures

Modem embedded systems often require high degrees of instruction-level parallelism (ILP) within strict constraints on power consumption and chip cost. Unfortunately, a high-performance embedded processor with high ILP generally puts large demands on register resources, making it difficult to maintain a single, multi-ported register bank. To address this problem, some architectures, e.g. the Texas Instruments TMS320C6x, partition the register bank into multiple banks that are each directly connected only to a subset of functional units. These functional unit/register bank groups are called clusters. Clustered architectures require that either copy operations or delay slots be inserted when an operation accesses data stored on a different cluster In order to generate excellent code for such architectures, the compiler must not only spread the computation across clusters to achieve maximum parallelism, but also must limit the effects of intercluster data transfers. Loop unrolling and unroll-and-jam enhance the parallelism in loops to help limit the effects of intercluster data transfers. In this paper we describe an accurate metric for predicting the intercluster communication cost of a loop and present an integer-optimization problem that can be used to guide the application of unroll-and-jam and loop unrolling considering the effects of both ILP and intercluster data transfers. Our method achieves a harmonic mean speedup of 1.4-1.7 on software pipelined loops for both a simulated architecture and the TI TMS320C64x.

[1]  D.A. Reed,et al.  An Integrated Compilation and Performance Analysis Environment for Data Parallel Programs , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[2]  Antonio González,et al.  Graph-partitioning based instruction scheduling for clustered processors , 2001, MICRO.

[3]  Philip H. Sweany,et al.  Improving software pipelining with unroll-and-jam , 1996, Proceedings of HICSS-29: 29th Hawaii International Conference on System Sciences.

[4]  David A. Poplawski The unlimited resource machine (urm) , 1995 .

[5]  A. Gonzalez,et al.  Graph-partitioning based instruction scheduling for clustered processors , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[6]  Philip H. Sweany,et al.  Register assignment for software pipelining with partitioned register banks , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[7]  Ken Kennedy,et al.  RETROSPECTIVE: Coloring Heuristics for Register Allocation , 2022 .

[8]  Javier Zalamea,et al.  Modulo scheduling with integrated register spilling for clustered VLIW architectures , 2001, MICRO.

[9]  Alexandre E. Eichenberger,et al.  Effective cluster assignment for modulo scheduling , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[10]  F. Jesús Sánchez Navarro,et al.  Instruction scheduling for clustered VLIW architectures , 2000 .

[11]  Antonio González,et al.  The effectiveness of loop unrolling for modulo scheduling in clustered VLIW architectures , 2000, Proceedings 2000 International Conference on Parallel Processing.

[12]  Ken Kennedy,et al.  Estimating Interlock and Improving Balance for Pipelined Architectures , 1988, J. Parallel Distributed Comput..

[13]  Hewlett-Packard,et al.  Iterative Modulo Scheduling : An Algorithm For Software , 1997 .

[14]  Ken Kennedy,et al.  Improving the ratio of memory operations to floating-point operations in loops , 1994, TOPL.

[15]  Philip H. Sweany,et al.  Value cloning for architectures with partitioned register banks , 1998 .

[16]  Nikil D. Dutt,et al.  Partitioned register files for VLIWs: a preliminary analysis of tradeoffs , 1992, MICRO 25.

[17]  Ken Kennedy,et al.  Scalar replacement in the presence of conditional control flow , 1994, Softw. Pract. Exp..

[18]  Yi Qian,et al.  Loop transformations for clustered vliw architectures , 2002 .

[19]  Monica S. Lam,et al.  RETROSPECTIVE : Software Pipelining : An Effective Scheduling Technique for VLIW Machines , 1998 .

[20]  Philip H. Sweany,et al.  Loop Transformations for Architectures with Partitioned Register Banks , 2001, OM '01.

[21]  Antonio González,et al.  A unified modulo scheduling and register allocation technique for clustered processors , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.