论文信息 - Evaluation of Bus Based Interconnect Mechanisms in Clustered VLIW Architectures

Evaluation of Bus Based Interconnect Mechanisms in Clustered VLIW Architectures

With new sophisticated compiler technology, it is possible to schedule distant instructions efficiently. As a consequence, the amount of exploitable instruction level parallelism (ILP) in applications has gone up considerably. However, monolithic register file VLIW architectures present scalability problems due to a centralized register file which is far slower than the functional units (FU). Clustered VLIW architectures, with a subset of FUs connected to any RF provide an attractive solution to address this issue. Recent studies with a wide variety of inter-cluster interconnection mechanisms have reported substantial gains in performance (number of cycles) over the most studied RF-to-RF type interconnections. However, these studies have compared only one or two design points in the RF-to-RF interconnects design space. In this paper, we extend the previous reported work. We consider both multi-cycle and pipelined buses. To obtain realistic bus latencies, we synthesized the various architectures and calculated post-layout clock periods. The results demonstrate that while there is less that 10% variation in interconnect area, the bus based architectures are slower by as much as 400%. Also, neither multi-cycle or pipelined buses nor increasing the number of buses itself is able to achieve performance comparable to point-to-point type interconnects.

Preeti Ranjan Panda | Anshul Kumar | M. Balakrishnan | Anup Gangwar

[1] Andrew Wolfe,et al. Available parallelism in video applications , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[2] Gustavo de Veciana,et al. High-quality operation binding for clustered VLIW datapaths , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[3] Anshul Kumar,et al. Impact of intercluster communication mechanisms on ILP in clustered VLIW architectures , 2007, TODE.

[4] Wayne Wolf,et al. Evaluation of Static and Dynamic Scheduling for Media Processors , 2000 .

[5] Wayne H. Wolf,et al. Parallel media processors for the billion-transistor era , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[6] Miodrag Potkonjak,et al. MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[7] James R. Goodman,et al. Billion-transistor architectures: there and back again , 2004, Computer.

[8] Mateo Valero,et al. Multiple-banked register file architectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[9] Thomas M. Conte,et al. Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[10] Improving Instruction-Level Parallelism by Exploiting Global Value Locality , 1998 .

[11] Preeti Ranjan Panda,et al. Evaluation of Bus Based Interconnect Mechanisms in Clustered VLIW Architectures , 2005 .

[12] Margaret Martonosi,et al. Limits and Graph Structure of Available Instruction-Level Parallelism (Research Note) , 2000, Euro-Par.

[13] Alexandru Nicolau,et al. Using an oracle to measure potential parallelism in single instruction stream programs , 1981, MICRO 14.

[14] Henk Corporaal,et al. Inter-cluster communication models for clustered VLIW processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[15] Giuseppe Desoli,et al. Instruction Assignment for Clustered VLIW DSP Compilers: A New Approach , 1998 .

[16] Anshul Kumar,et al. Impact of intercluster communication mechanisms on ILP in clustered VLIW architectures , 2007, ACM Trans. Design Autom. Electr. Syst..

[17] Michael Gschwind,et al. Optimizations and oracle parallelism with dynamic translation , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[18] T. J. Watson,et al. CARS: A New Code Generation Framework for Clustered ILP Processors , 2001 .

[19] Thorsten von Eicken,et al. 技術解説 IEEE Computer , 1999 .

[20] Yale N. Patt,et al. One Billion Transistors, One Uniprocessor, One Chip , 1997, Computer.

[21] D. Burger,et al. Billion-Transistor Architectures , 1997, Computer.

[22] William J. Dally,et al. Register organization for media processing , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[23] F. Jesús Sánchez Navarro,et al. Instruction scheduling for clustered VLIW architectures , 2000 .

[24] P. Faraboschi,et al. Lx: a technology platform for customizable VLIW embedded processing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[25] Krste Asanovic,et al. Banked multiported register files for high-frequency superscalar microprocessors , 2003, ISCA '03.

[26] Lizy Kurian John,et al. Evaluating signal processing and multimedia applications on SIMD, VLIW and superscalar architectures , 2000, Proceedings 2000 International Conference on Computer Design.