Widening resources: a cost-effective technique for aggressive ILP architectures

The inherent instruction-level parallelism (ILP) of current applications (specially those based on floating point computations) has driven hardware designers and compilers writers to investigate aggressive techniques for exploiting program parallelism at the lowest level. To execute more operations per cycle, many processors are designed with growing degrees of resource replication (buses and functional units). However the high cost in terms of area and cycle time of this technique precludes the use of high degrees of replication. An alternative to resource replication is resource widening, that has also been used in some recent designs, in which the width of the resources is increased. In this paper we evaluate a broad set of design alternatives that combine both replication and widening. For each alternative we perform an estimation of the ILP limits (including the impact of spill code for several register file configurations) and the cost in terms of area and access time of the register file. We also perform a technological projection for the next 10 years in order to foresee the possible implementable alternatives. From this study we conclude that if the cost is taken into account, the best performance is obtained when combining certain degrees of replication and widening in the hardware resources. The results have been obtained from a large number of inner loops from numerical programs scheduled for VLIW architectures.

[1]  Josep Llosa,et al.  Resource widening versus replication: limits and performance-cost trade-off , 1998, ICS '98.

[2]  Tadashi Watanabe The NEC SX-3 supercomputer system , 1991, COMPCON Spring '91 Digest of Papers.

[3]  Todd M. Austin,et al.  High-Bandwidth Address Translation for Multiple-Issue Processors , 1996, ISCA.

[4]  A. Gonzalez,et al.  Hypernode reduction modulo scheduling , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[5]  Josep Llosa,et al.  Increasing memory bandwidth with wide buses: compiler, hardware and performance trade-offs , 1997, ICS '97.

[6]  Nikil D. Dutt,et al.  Partitioned register files for VLIWs: a preliminary analysis of tradeoffs , 1992, MICRO 25.

[7]  Geoffrey C. Fox,et al.  The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers , 1989, Int. J. High Perform. Comput. Appl..

[8]  Corinna G. Lee,et al.  Code optimizers and register organizations for vector architectures , 1992 .

[9]  B. Ramakrishna Rau,et al.  Register allocation for software pipelined loops , 1992, PLDI '92.

[10]  Norman P. Jouppi,et al.  CACTI: an enhanced cache access and cycle time model , 1996, IEEE J. Solid State Circuits.

[11]  Sam Harrell,et al.  The national technology roadmap for semiconductors and SEMATECH future directions , 1996 .

[12]  Norman P. Jouppi,et al.  Memory-System Design Considerations for Dynamically-Scheduled Processors , 1997, ISCA.

[13]  Steven W. White,et al.  POWER2: Next generation of the RISC System/6000 family , 1994, IBM J. Res. Dev..

[14]  Josep Llosa,et al.  Heuristics for register-constrained software pipelining , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[15]  R. D. Jolly,et al.  A 9-ns, 1.4-gigabyte/s, 17-ported CMOS register file , 1991 .

[16]  B. Ramakrishna Rau,et al.  Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[17]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[18]  Josep Llosa,et al.  Modulo Scheduling with Reduced Register Pressure , 1998, IEEE Trans. Computers.

[19]  B. Ramakrishna Rau,et al.  Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing , 1981, MICRO 14.

[20]  P. Chow,et al.  Memory-system Design Considerations For Dynamically-scheduled Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[21]  Olivier Temam,et al.  Data caches for superscalar processors , 1997, ICS '97.