论文信息 - Impact of intercluster communication mechanisms on ILP in clustered VLIW architectures

Impact of intercluster communication mechanisms on ILP in clustered VLIW architectures

VLIW processors have started gaining acceptance in the embedded systems domain. However, monolithic register file VLIW processors with a large number of functional units are not viable. This is because of the need for a large number of ports to support FU requirements, which makes them expensive and extremely slow. A simple solution is to break the register file into a number of smaller register files with a subset of FUs connected to it. These architectures are termed clustered VLIW processors. In this article, we first build a case for clustered VLIW processors with four or more clusters by showing that the achievable ILP in most of the media applications for a 16 ALU and 8 LD/ST VLIW processor is around 20. We then provide a classification of the intercluster interconnection design space, and show that a large part of this design space is currently unexplored. Next, using our performance evaluation methodology, we evaluate a subset of this design space and show that the most commonly used type of interconnection, RF-to-RF, fails to meet achievable performance by a large factor, while certain other types of interconnections can lower this gap considerably. We also establish that this behavior is heavily application dependent, emphasizing the importance of application-specific architecture exploration. We also present results about the statistical behavior of these different architectures by varying the number of clusters in our framework from 4 to 16. These results clearly show the advantages of one specific architecture over others. Finally, based on our results, we propose a new interconnection network, which should lower this performance gap.

Anshul Kumar | M. Balakrishnan | Anup Gangwar

[1] Preeti Ranjan Panda,et al. Evaluation of Bus Based Interconnect Mechanisms in Clustered VLIW Architectures , 2005, Design, Automation and Test in Europe.

[2] Mateo Valero,et al. Multiple-banked register file architectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[3] R. Gittins. A call for engineering activism , 2001 .

[4] Rainer Leupers,et al. Instruction scheduling for clustered VLIW DSPs , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[5] Wayne H. Wolf,et al. Parallel media processors for the billion-transistor era , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[6] Gustavo de Veciana,et al. High-quality operation binding for clustered VLIW datapaths , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[7] Vivek Sarkar,et al. Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.

[8] G. de Veciana,et al. Exploring performance tradeoffs for clustered VLIW ASIPs , 2000, IEEE/ACM International Conference on Computer Aided Design. ICCAD - 2000. IEEE/ACM Digest of Technical Papers (Cat. No.00CH37140).

[9] Wen-mei W. Hwu,et al. IMPACT: an architectural framework for multiple-instruction-issue processors , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[10] Thomas M. Conte,et al. Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[11] B. Ramakrishna Rau,et al. Elcor's Machine Description System: Version 3.0 , 1998 .

[12] Javier Zalamea,et al. Modulo scheduling with integrated register spilling for clustered VLIW architectures , 2001, MICRO.

[13] Giuseppe Desoli,et al. Instruction Assignment for Clustered VLIW DSP Compilers: A New Approach , 1998 .

[14] E. Ayguade,et al. Modulo scheduling with integrated register spilling for clustered VLIW architectures , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[15] Lizy Kurian John,et al. Improving dynamic cluster assignment for clustered trace cache processors , 2003, ISCA '03.

[16] William J. Dally,et al. Register organization for media processing , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[17] Gustavo de Veciana,et al. Design Challenges for New Application-Specific Processors , 2000, IEEE Des. Test Comput..

[18] James E. Smith,et al. Instruction-Level Distributed Processing , 2001, Computer.

[19] Scott Mahlke,et al. Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 1992.

[20] P. Faraboschi,et al. Lx: a technology platform for customizable VLIW embedded processing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[21] Antonio González,et al. An interleaved cache clustered VLIW processor , 2002, ICS '02.

[22] Scott A. Mahlke,et al. The superblock: An effective technique for VLIW and superscalar compilation , 1993, The Journal of Supercomputing.

[23] Junqiang Sun,et al. Tms320c6000 cpu and instruction set reference guide , 2000 .

[24] Noah Treuhaft,et al. Scalable Processors in the Billion-Transistor Era: IRAM , 1997, Computer.

[25] Paolo Faraboschi,et al. Custom-fit processors: letting applications define architectures , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[26] Antonio González,et al. A unified modulo scheduling and register allocation technique for clustered processors , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[27] William J. Dally,et al. Communication Scheduling , 2000, ASPLOS.

[28] Preeti Ranjan Panda,et al. Evaluation of Bus Based Interconnect Mechanisms in Clustered VLIW Architectures , 2005 .

[29] Thomas M. Conte,et al. Treegion Scheduling for Highly Parallel Processors , 1997, Euro-Par.

[30] T. J. Watson,et al. CARS: A New Code Generation Framework for Clustered ILP Processors , 2001 .

[31] Youfeng Wu,et al. Quantifying instruction-level parallelism limits on an EPIC architecture , 2000, 2000 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS (Cat. No.00EX422).

[32] Margaret Martonosi,et al. Limits and Graph Structure of Available Instruction-Level Parallelism (Research Note) , 2000, Euro-Par.

[33] F. Jesús Sánchez Navarro,et al. Instruction scheduling for clustered VLIW architectures , 2000 .

[34] Henk Corporaal,et al. Inter-cluster communication models for clustered VLIW processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[35] Miodrag Potkonjak,et al. MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[36] Jonathan Rose,et al. The Transmogrifier-2: a 1 million gate rapid prototyping system , 1997, FPGA '97.

[37] Wayne Wolf,et al. Evaluation of Static and Dynamic Scheduling for Media Processors , 2000 .