Resource widening versus replication: limits and performance-cost trade-off

A balanced increase of memory bandwidth and computational capabilities is going to be one of the trends in the designSof near future high-performance microprocessors. Alternative solutions are foreseen for the organization of their resources, mainly based on different degrees of resource re lication and/or adaptation of resources to the most frequenty found P o erations l? in hi hly performance demanding apphcations. or instance dou fc hng the width of buses between the register file and the first-level data cache is an exam le of design that attains similar performance results than dou !* hng the number of buses in numerical applications. In this paper we evaluate the cost/performance trade-off of a wide set of design alternatives oriented towards having high memory bandwtdth and computin degrees o capabilities in future architectures. Different the generation of the lication and widening are the basis for ace. Performance evaluation IS based on the results obtaine t; for a large number of inner loops in the Perfect Club benchmarks. Implementation costs for the register file and functional units are estimated for different foreseen integration technologies which allow us to analyse their future availability. The results show that replicating is most effective in terms performance but results m an unafordable cost while widening has a much smaller cost but less performance. Combining a small degree of widening and replication results in the best performance/cost ratio.

[1]  R. D. Jolly,et al.  A 9-ns, 1.4-gigabyte/s, 17-ported CMOS register file , 1991 .

[2]  B. Ramakrishna Rau,et al.  Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[3]  A. Gonzalez,et al.  Hypernode reduction modulo scheduling , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[4]  FranklinManoj,et al.  High-bandwidth data memory systems for superscalar processors , 1991 .

[5]  Vicki H. Allan,et al.  Software pipelining: a comparison and improvement , 1990, [1990] Proceedings of the 23rd Annual Workshop and Symposium@m_MICRO 23: Microprogramming and Microarchitecture.

[6]  Gurindar S. Sohi,et al.  High-bandwidth data memory systems for superscalar processors , 1991, ASPLOS IV.

[7]  David Chih-Wei Chang,et al.  Microarchitecture of HaL's memory management unit , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[8]  B. Ramakrishna Rau,et al.  Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing , 1981, MICRO 14.

[9]  Kamran Eshraghian,et al.  Principles of CMOS VLSI Design: A Systems Perspective , 1985 .

[10]  John H. Edmondson,et al.  Superscalar instruction execution in the 21164 Alpha microprocessor , 1995, IEEE Micro.

[11]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[12]  Josep Llosa,et al.  Increasing memory bandwidth with wide buses: compiler, hardware and performance trade-offs , 1997, ICS '97.

[13]  Geoffrey C. Fox,et al.  The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers , 1989, Int. J. High Perform. Comput. Appl..

[14]  Corinna G. Lee,et al.  Code optimizers and register organizations for vector architectures , 1992 .