Impact of intercluster communication mechanisms on ILP in clustered VLIW architectures

VLIW processors have started gaining acceptance in the embedded systems domain. However, monolithic register file VLIW processors with a large number of functional units are not viable. This is because of the need for a large number of ports to support FU requirements, which makes them expensive and extremely slow. A simple solution is to break the register file into a number of smaller register files with a subset of FUs connected to it. These architectures are termed clustered VLIW processors. In this article, we first build a case for clustered VLIW processors with four or more clusters by showing that the achievable ILP in most of the media applications for a 16 ALU and 8 LD/ST VLIW processor is around 20. We then provide a classification of the intercluster interconnection design space, and show that a large part of this design space is currently unexplored. Next, using our performance evaluation methodology, we evaluate a subset of this design space and show that the most commonly used type of interconnection, RF-to-RF, fails to meet achievable performance by a large factor, while certain other types of interconnections can lower this gap considerably. We also establish that this behavior is heavily application dependent, emphasizing the importance of application-specific architecture exploration. We also present results about the statistical behavior of these different architectures by varying the number of clusters in our framework from 4 to 16. These results clearly show the advantages of one specific architecture over others. Finally, based on our results, we propose a new interconnection network, which should lower this performance gap.

[1]  Preeti Ranjan Panda,et al.  Evaluation of Bus Based Interconnect Mechanisms in Clustered VLIW Architectures , 2005, Design, Automation and Test in Europe.

[2]  Mateo Valero,et al.  Multiple-banked register file architectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[3]  R. Gittins A call for engineering activism , 2001 .

[4]  Rainer Leupers,et al.  Instruction scheduling for clustered VLIW DSPs , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[5]  Wayne H. Wolf,et al.  Parallel media processors for the billion-transistor era , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[6]  Gustavo de Veciana,et al.  High-quality operation binding for clustered VLIW datapaths , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[7]  Vivek Sarkar,et al.  Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.

[8]  G. de Veciana,et al.  Exploring performance tradeoffs for clustered VLIW ASIPs , 2000, IEEE/ACM International Conference on Computer Aided Design. ICCAD - 2000. IEEE/ACM Digest of Technical Papers (Cat. No.00CH37140).

[9]  Wen-mei W. Hwu,et al.  IMPACT: an architectural framework for multiple-instruction-issue processors , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[10]  Thomas M. Conte,et al.  Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[11]  B. Ramakrishna Rau,et al.  Elcor's Machine Description System: Version 3.0 , 1998 .

[12]  Javier Zalamea,et al.  Modulo scheduling with integrated register spilling for clustered VLIW architectures , 2001, MICRO.

[13]  Giuseppe Desoli,et al.  Instruction Assignment for Clustered VLIW DSP Compilers: A New Approach , 1998 .

[14]  E. Ayguade,et al.  Modulo scheduling with integrated register spilling for clustered VLIW architectures , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[15]  Lizy Kurian John,et al.  Improving dynamic cluster assignment for clustered trace cache processors , 2003, ISCA '03.

[16]  William J. Dally,et al.  Register organization for media processing , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[17]  Gustavo de Veciana,et al.  Design Challenges for New Application-Specific Processors , 2000, IEEE Des. Test Comput..

[18]  James E. Smith,et al.  Instruction-Level Distributed Processing , 2001, Computer.

[19]  Scott Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 1992.

[20]  P. Faraboschi,et al.  Lx: a technology platform for customizable VLIW embedded processing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[21]  Antonio González,et al.  An interleaved cache clustered VLIW processor , 2002, ICS '02.

[22]  Scott A. Mahlke,et al.  The superblock: An effective technique for VLIW and superscalar compilation , 1993, The Journal of Supercomputing.

[23]  Junqiang Sun,et al.  Tms320c6000 cpu and instruction set reference guide , 2000 .

[24]  Noah Treuhaft,et al.  Scalable Processors in the Billion-Transistor Era: IRAM , 1997, Computer.

[25]  Paolo Faraboschi,et al.  Custom-fit processors: letting applications define architectures , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[26]  Antonio González,et al.  A unified modulo scheduling and register allocation technique for clustered processors , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[27]  William J. Dally,et al.  Communication Scheduling , 2000, ASPLOS.

[28]  Preeti Ranjan Panda,et al.  Evaluation of Bus Based Interconnect Mechanisms in Clustered VLIW Architectures , 2005 .

[29]  Thomas M. Conte,et al.  Treegion Scheduling for Highly Parallel Processors , 1997, Euro-Par.

[30]  T. J. Watson,et al.  CARS: A New Code Generation Framework for Clustered ILP Processors , 2001 .

[31]  Youfeng Wu,et al.  Quantifying instruction-level parallelism limits on an EPIC architecture , 2000, 2000 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS (Cat. No.00EX422).

[32]  Margaret Martonosi,et al.  Limits and Graph Structure of Available Instruction-Level Parallelism (Research Note) , 2000, Euro-Par.

[33]  F. Jesús Sánchez Navarro,et al.  Instruction scheduling for clustered VLIW architectures , 2000 .

[34]  Henk Corporaal,et al.  Inter-cluster communication models for clustered VLIW processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[35]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[36]  Jonathan Rose,et al.  The Transmogrifier-2: a 1 million gate rapid prototyping system , 1997, FPGA '97.

[37]  Wayne Wolf,et al.  Evaluation of Static and Dynamic Scheduling for Media Processors , 2000 .