Decoupling local variable accesses in a wide-issue superscalar processor

Providing adequate data bandwidth is extremely important for a wide-issue superscalar processor to achieve its full performance potential. Adding a large number of ports to a data cache, however, becomes increasingly inefficient and can add to the hardware complexity significantly. This paper takes an alternative or complementary approach for providing more data bandwidth, called the data-decoupled architecture. The approach, with support from the compiler and/or hardware, partitions the memory stream into two independent streams early in the processor pipeline, and feeds each stream to a separate memory access queue and cache. Under this model, the paper studies the potential of decoupling memory accesses to program's local variables that are allocated on the run-time stack. Using a set of integer and floating-point programs from the SPEC95 benchmark suite, it is shown that local variable accesses constitute a large portion of all the memory references, while their reference space is very small, averaging around 7 words per (static) procedure. To service local variable accesses quickly, two optimizations, fast data forwarding and access combining, are proposed and studied. Some of the important design parameters, such as the cache size, the number of cache ports, and the degree of access combining, are studied based on simulations. The potential performance of the proposed scheme is measured using various configurations, and it is concluded that the scheme can become a viable alternative to building a single multi-ported data cache.

[1]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[2]  Gary S. Tyson,et al.  Improving the accuracy and performance of memory communication through renaming , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[3]  Kenneth M. Wilson,et al.  Increasing Cache Port Efficiency for Dynamic Superscalar Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[4]  Michael J. Flynn,et al.  Execution Architecture: The DELtran Experiment , 1983, IEEE Transactions on Computers.

[5]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[6]  David R. Ditzel,et al.  Register allocation for free: The C machine stack cache , 1982, ASPLOS I.

[7]  Quinn Jacobson,et al.  Trace processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[8]  Gary S. Tyson,et al.  On high-bandwidth data cache design for multi-issue processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[9]  Andreas Moshovos,et al.  Dynamic Speculation and Synchronization of Data Dependences , 1997, ISCA.

[10]  Marc Tremblay,et al.  A three dimensional register file for superscalar processors , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.

[11]  Mark D. Hill,et al.  A case for direct-mapped caches , 1988, Computer.

[12]  Yale N. Patt,et al.  One Billion Transistors, One Uniprocessor, One Chip , 1997, Computer.

[13]  Gregory J. Chaitin,et al.  Register allocation and spilling via graph coloring , 2004, SIGP.

[14]  Ruben W. Castelino,et al.  Internal Organization of the Alpha 21164, a 300-MHz 64-bit Quad-issue CMOS RISC Microprocessor , 1995, Digit. Tech. J..

[15]  Robert G. Wedig,et al.  A performance analysis of automatically managed top of stack buffers , 1987, ISCA '87.

[16]  P. Yew,et al.  Decoupling local variable accesses in a wide-issue superscalar processor , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[17]  Eric Rotenberg,et al.  Trace cache: a low latency approach to high bandwidth instruction fetching , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[18]  Yale N. Patt,et al.  Increasing the instruction fetch rate via multiple branch prediction and a branch address cache , 1993, ICS '93.

[19]  Douglas W. Clark,et al.  A Characterization of Processor Performance in the vax-11/780 , 1984, ISCA '84.

[20]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[21]  Carlo H. Séquin,et al.  A VLSI RISC , 1982, Computer.

[22]  Matthew T. O'Keefe,et al.  Spill code minimization via interference region spilling , 1997, PLDI '97.

[23]  BurgerDoug,et al.  The SimpleScalar tool set, version 2.0 , 1997 .

[24]  Sangyeun Cho,et al.  Access region locality for high-bandwidth processor memory system design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[25]  FranklinManoj,et al.  High-bandwidth data memory systems for superscalar processors , 1991 .

[26]  Carlo H. Séquin,et al.  Strategies for Managing the Register File in RISC , 1983, IEEE Transactions on Computers.

[27]  S SohiGurindar Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers , 1990 .

[28]  John L. Hennessy,et al.  The priority-based coloring approach to register allocation , 1990, TOPL.

[29]  Gurindar S. Sohi,et al.  High-bandwidth data memory systems for superscalar processors , 1991, ASPLOS IV.

[30]  Mikko H. Lipasti,et al.  Superspeculative Microarchitecture for Beyond AD 2000 , 1997, Computer.

[31]  Andreas Moshovos,et al.  Streamlining inter-operation memory communication via data dependence prediction , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[32]  G.S. Sohi,et al.  Dynamic Speculation And Synchronization Of Data Dependence , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[33]  Kenneth M. Wilson,et al.  Designing High Bandwidth On-chip Caches , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[34]  Thorsten von Eicken,et al.  技術解説 IEEE Computer , 1999 .