Evaluation of Bus Based Interconnect Mechanisms in Clustered VLIW Architectures

With new sophisticated compiler technology, it is possible to schedule distant instructions efficiently. As a consequence, the amount of exploitable instruction level parallelism (ILP) in applications has gone up considerably. However, monolithic register file VLIW architectures present scalability problems due to a centralized register file which is far slower than the functional units (FU). Clustered VLIW architectures, with a subset of FUs connected to any RF provide an attractive solution to address this issue. Recent studies with a wide variety of inter-cluster interconnection mechanisms have reported substantial gains in performance (number of cycles) over the most studied RF-to-RF type interconnections. However, these studies have compared only one or two design points in the RF-to-RF interconnects design space. In this paper, we extend the previous reported work. We consider both multi-cycle and pipelined buses. To obtain realistic bus latencies, we synthesized the various architectures and calculated post-layout clock periods. The results demonstrate that while there is less that 10% variation in interconnect area, the bus based architectures are slower by as much as 400%. Also, neither multi-cycle or pipelined buses nor increasing the number of buses itself is able to achieve performance comparable to point-to-point type interconnects.

[1]  Andrew Wolfe,et al.  Available parallelism in video applications , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[2]  Gustavo de Veciana,et al.  High-quality operation binding for clustered VLIW datapaths , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[3]  Anshul Kumar,et al.  Impact of intercluster communication mechanisms on ILP in clustered VLIW architectures , 2007, TODE.

[4]  Wayne Wolf,et al.  Evaluation of Static and Dynamic Scheduling for Media Processors , 2000 .

[5]  Wayne H. Wolf,et al.  Parallel media processors for the billion-transistor era , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[6]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[7]  James R. Goodman,et al.  Billion-transistor architectures: there and back again , 2004, Computer.

[8]  Mateo Valero,et al.  Multiple-banked register file architectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[9]  Thomas M. Conte,et al.  Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[10]  Improving Instruction-Level Parallelism by Exploiting Global Value Locality , 1998 .

[11]  Preeti Ranjan Panda,et al.  Evaluation of Bus Based Interconnect Mechanisms in Clustered VLIW Architectures , 2005 .

[12]  Margaret Martonosi,et al.  Limits and Graph Structure of Available Instruction-Level Parallelism (Research Note) , 2000, Euro-Par.

[13]  Alexandru Nicolau,et al.  Using an oracle to measure potential parallelism in single instruction stream programs , 1981, MICRO 14.

[14]  Henk Corporaal,et al.  Inter-cluster communication models for clustered VLIW processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[15]  Giuseppe Desoli,et al.  Instruction Assignment for Clustered VLIW DSP Compilers: A New Approach , 1998 .

[16]  Anshul Kumar,et al.  Impact of intercluster communication mechanisms on ILP in clustered VLIW architectures , 2007, ACM Trans. Design Autom. Electr. Syst..

[17]  Michael Gschwind,et al.  Optimizations and oracle parallelism with dynamic translation , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[18]  T. J. Watson,et al.  CARS: A New Code Generation Framework for Clustered ILP Processors , 2001 .

[19]  Thorsten von Eicken,et al.  技術解説 IEEE Computer , 1999 .

[20]  Yale N. Patt,et al.  One Billion Transistors, One Uniprocessor, One Chip , 1997, Computer.

[21]  D. Burger,et al.  Billion-Transistor Architectures , 1997, Computer.

[22]  William J. Dally,et al.  Register organization for media processing , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[23]  F. Jesús Sánchez Navarro,et al.  Instruction scheduling for clustered VLIW architectures , 2000 .

[24]  P. Faraboschi,et al.  Lx: a technology platform for customizable VLIW embedded processing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[25]  Krste Asanovic,et al.  Banked multiported register files for high-frequency superscalar microprocessors , 2003, ISCA '03.

[26]  Lizy Kurian John,et al.  Evaluating signal processing and multimedia applications on SIMD, VLIW and superscalar architectures , 2000, Proceedings 2000 International Conference on Computer Design.