A software-hardware hybrid steering mechanism for clustered microarchitectures

Clustered microarchitectures provide a promising paradigm to solve or alleviate the problems of increasing microprocessor complexity and wire delays. High- performance out-of-order processors rely on hardware-only steering mechanisms to achieve balanced workload distribution among clusters. However, the additional steering logic results in a significant increase on complexity, which actually decreases the benefits of the clustered design. In this paper, we address this complexity issue and present a novel software-hardware hybrid steering mechanism for out-of-order processors. The proposed software- hardware cooperative scheme makes use of the concept of virtual clusters. Instructions are distributed to virtual clusters at compile time using static properties of the program such as data dependences. Then, at runtime, virtual clusters are mapped into physical clusters by considering workload information. Experiments using SPEC CPU2000 benchmarks show that our hybrid approach can achieve almost the same performance as a state-of-the-art hardware-only steering scheme, while requiring low hardware complexity. In addition, the proposed mechanism outperforms state-of-the-art software-only steering mechanisms by 5% and 10% on average for 2-cluster and 4-cluster machines, respectively.

[1]  Antonio González,et al.  A unified modulo scheduling and register allocation technique for clustered processors , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[2]  John R. Ellis,et al.  Bulldog: A Compiler for VLIW Architectures , 1986 .

[3]  A. Gonzalez,et al.  Graph-partitioning based instruction scheduling for clustered processors , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[4]  Ramon Canal,et al.  Dynamic cluster assignment mechanisms , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[5]  Norman P. Jouppi,et al.  The multicluster architecture: reducing cycle time through partitioning , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[6]  Craig B. Zilles,et al.  A criticality analysis of clustering in superscalar processors , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[7]  Kemal Ebcioglu,et al.  CARS: a new code generation framework for clustered ILP processors , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[8]  Rajiv Kapoor,et al.  Pinpointing Representative Portions of Large Intel® Itanium® Programs with Dynamic Instrumentation , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[9]  Vipin Kumar,et al.  Analysis of Multilevel Graph Partitioning , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[10]  Thomas M. Conte,et al.  Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[11]  José González,et al.  Cache organizations for clustered microarchitectures , 2004, WMPI '04.

[12]  Lizy Kurian John,et al.  Scaling to the end of silicon with EDGE architectures , 2004, Computer.

[13]  Giuseppe Desoli,et al.  Instruction Assignment for Clustered VLIW DSP Compilers: A New Approach , 1998 .

[14]  Josep Llosa,et al.  A comparative study of modulo scheduling techniques , 2002, ICS '02.

[15]  Philip H. Sweany,et al.  A Code Generation Framework for VLIW Architectures with Partitioned Register Banks , 2007 .

[16]  Kathryn S. McKinley,et al.  Static placement, dynamic issue (SPDI) scheduling for EDGE architectures , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[17]  David R. Kaeli,et al.  Exploiting pseudo-schedules to guide data dependence graph partitioning , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[18]  Antonio González,et al.  Graph-partitioning based instruction scheduling for clustered processors , 2001, MICRO.

[19]  B. Ramakrishna Rau,et al.  Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing , 1981, MICRO 14.

[20]  Antonio González,et al.  The effectiveness of loop unrolling for modulo scheduling in clustered VLIW architectures , 2000, Proceedings 2000 International Conference on Parallel Processing.

[21]  Scott A. Mahlke,et al.  Region-based hierarchical operation partitioning for multicluster processors , 2003, PLDI '03.

[22]  James E. Smith,et al.  Exploiting idle floating-point resources for integer execution , 1998, PLDI.

[23]  Nikil D. Dutt,et al.  Partitioned register files for VLIWs: a preliminary analysis of tradeoffs , 1992, MICRO 25.

[24]  J. M. Codina,et al.  Virtual Cluster Scheduling Through the Scheduling Graph , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[25]  Alexandre E. Eichenberger,et al.  Effective cluster assignment for modulo scheduling , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[26]  Andreas Moshovos,et al.  Instruction distribution heuristics for quad-cluster, dynamically-scheduled, superscalar processors , 2000, MICRO 33.

[27]  José González,et al.  Thermal-aware clustered microarchitectures , 2004, IEEE International Conference on Computer Design: VLSI in Computers and Processors, 2004. ICCD 2004. Proceedings..