论文信息 - Inherently workload-balanced clustered microarchitecture

Inherently workload-balanced clustered microarchitecture

The performance of clustered microarchitectures relies on steering schemes that try to find the best trade-off between workload balance and inter-cluster communication penalties. In previously proposed clustered processors, reducing communication penalties and balancing the workload are opposite targets, since improving one usually implies a detriment in the other. In this paper we propose a new clustered microarchitecture that can minimize communication penalties without compromising workload balance. The key idea is to arrange the clusters in a ring topology in such a way that results of one cluster can be forwarded to the neighbor cluster with a very short latency. In this way, minimizing communication penalties is favored when the producer of a value and its consumer are placed in adjacent clusters, which also favors workload balance. The proposed microarchitecture is shown to outperform a state-of-the-art clustered processor. For instance, for an 8-cluster configuration and just one fully pipelined unidirectional bus, 15% speedup is achieved on average for FP programs.

Jaume Abella | Antonio González

[1] James E. Smith,et al. Complexity-Effective Superscalar Processors , 1997, ISCA.

[2] André Seznec,et al. Register write specialization register read specialization: a path to complexity-effective wide-issue superscalar processors , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[3] Vikas Agarwal,et al. Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[4] Manoj Franklin,et al. The multiscalar architecture , 1993 .

[5] José Duato,et al. Efficient interconnects for clustered microarchitectures , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[6] S. Vajapeyam,et al. Improving Superscalar Instruction Dispatch And Issue By Exploiting Dynamic Code Sequences , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[7] Ramon Canal,et al. Dynamic cluster assignment mechanisms , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[8] Ken Mai,et al. The future of wires , 2001, Proc. IEEE.

[9] Victor V. Zyuban,et al. Inherently Lower-Power High-Performance Superscalar Architectures , 2001, IEEE Trans. Computers.

[10] Peter M. Kogge,et al. Inherently Lower-Power High-Performance , 2001 .

[11] Quinn Jacobson,et al. Trace processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[12] André Seznec,et al. Register write specialization register read specialization: a path to complexity-effective wide-issue superscalar processors , 2002, MICRO 35.

[13] Antonio María González Colás,et al. Reducing wire delay penalty through value prediction , 2000, MICRO 2000.

[14] Manoj Franklin,et al. An empirical study of decentralized ILP execution models , 1998, ASPLOS VIII.

[15] R. Nagarajan,et al. A design space evaluation of grid processor architectures , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[16] Todd M. Austin,et al. The SimpleScalar tool set, version 2.0 , 1997, CARN.

[17] Norman P. Jouppi,et al. The multicluster architecture: reducing cycle time through partitioning , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[18] Shashank Gupta,et al. Technology Independent Area and Delay Estimations for MicroprocessorBuilding Blocks , 2001 .

[19] Ramon Canal,et al. A cost-effective clustered architecture , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).