Inherently workload-balanced clustered microarchitecture

The performance of clustered microarchitectures relies on steering schemes that try to find the best trade-off between workload balance and inter-cluster communication penalties. In previously proposed clustered processors, reducing communication penalties and balancing the workload are opposite targets, since improving one usually implies a detriment in the other. In this paper we propose a new clustered microarchitecture that can minimize communication penalties without compromising workload balance. The key idea is to arrange the clusters in a ring topology in such a way that results of one cluster can be forwarded to the neighbor cluster with a very short latency. In this way, minimizing communication penalties is favored when the producer of a value and its consumer are placed in adjacent clusters, which also favors workload balance. The proposed microarchitecture is shown to outperform a state-of-the-art clustered processor. For instance, for an 8-cluster configuration and just one fully pipelined unidirectional bus, 15% speedup is achieved on average for FP programs.

[1]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, ISCA.

[2]  André Seznec,et al.  Register write specialization register read specialization: a path to complexity-effective wide-issue superscalar processors , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[3]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[4]  Manoj Franklin,et al.  The multiscalar architecture , 1993 .

[5]  José Duato,et al.  Efficient interconnects for clustered microarchitectures , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[6]  S. Vajapeyam,et al.  Improving Superscalar Instruction Dispatch And Issue By Exploiting Dynamic Code Sequences , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[7]  Ramon Canal,et al.  Dynamic cluster assignment mechanisms , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[8]  Ken Mai,et al.  The future of wires , 2001, Proc. IEEE.

[9]  Victor V. Zyuban,et al.  Inherently Lower-Power High-Performance Superscalar Architectures , 2001, IEEE Trans. Computers.

[10]  Peter M. Kogge,et al.  Inherently Lower-Power High-Performance , 2001 .

[11]  Quinn Jacobson,et al.  Trace processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[12]  André Seznec,et al.  Register write specialization register read specialization: a path to complexity-effective wide-issue superscalar processors , 2002, MICRO 35.

[13]  Antonio María González Colás,et al.  Reducing wire delay penalty through value prediction , 2000, MICRO 2000.

[14]  Manoj Franklin,et al.  An empirical study of decentralized ILP execution models , 1998, ASPLOS VIII.

[15]  R. Nagarajan,et al.  A design space evaluation of grid processor architectures , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[16]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[17]  Norman P. Jouppi,et al.  The multicluster architecture: reducing cycle time through partitioning , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[18]  Shashank Gupta,et al.  Technology Independent Area and Delay Estimations for MicroprocessorBuilding Blocks , 2001 .

[19]  Ramon Canal,et al.  A cost-effective clustered architecture , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).