Latency-Tolerant Virtual Cluster Architecture for VLIW DSP

This paper proposes a virtual cluster architecture, which executes multi-cluster VLIW programs with a reduced number of clusters in a time-sharing fashion. The interleaved sub-VLIWs help to hide instruction latencies significantly, and thus the proposed virtual cluster will have advantages of (1) reduced forwarding complexity in the processor datapath, (2) improved programming model for further code optimizations, and (3) supporting composite instructions without any extra functional unit. In our experiments with a 4-cluster VLIW DSP, the 28 forwarding paths inside a cluster are completely eliminated, which contributes to savings of 21.71% delay and 17.56% silicon area. Moreover, the virtual cluster has been verified to have better efficiency on its code sizes and execution times for its improved programming model for various DSP kernels.

[1]  P. Groves,et al.  A 600 MHz VLIW DSP , 2002, 2002 IEEE International Solid-State Circuits Conference. Digest of Technical Papers (Cat. No.02CH37315).

[2]  William J. Dally,et al.  Register organization for media processing , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[3]  Henk Corporaal,et al.  Inter-cluster communication models for clustered VLIW processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[4]  Chein-Wei Jen,et al.  A unified processor architecture for RISC & VLIW DSP , 2005, ACM Great Lakes Symposium on VLSI.

[5]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[6]  Z. Greenfield,et al.  The TigerSHARC DSP Architecture , 2000, IEEE Micro.