Access region locality for high-bandwidth processor memory system design

This paper studies an interesting yet less explored behavior of memory access instructions, called access region locality. Unlike the traditional temporal and spatial data locality that focuses on individual memory locations and how accesses to the locations are inter-related, the access region locality concerns with each static memory instruction and its range of access locations at run time. We consider program's data, heap, and stack regions in this paper. Our experimental study using a set of SPEC95 benchmark programs shows that most memory reference instructions access a single region at run time. Also shown is that it is possible to accurately predict the access region of a memory instruction at run time by scrutinizing the addressing mode of the instruction and the past access region history of it. A simple run-time access region predictor is developed that is similar to a branch predictor in structure. We describe and evaluate a superscalar processor with two distinct sets of memory pipelines, driven by the access region predictor. Experimental results indicate that the proposed mechanism is very effective in providing high memory bandwidth to the processor, resulting in comparable or better performance than a conventional memory design with a heavily multi-ported data cache that can lead to much higher hardware complexity.

[1]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[2]  Gary S. Tyson,et al.  Improving the accuracy and performance of memory communication through renaming , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[3]  Doug Hunt,et al.  Advanced performance features of the 64-bit PA-8000 , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[4]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .

[5]  Kenneth M. Wilson,et al.  Increasing Cache Port Efficiency for Dynamic Superscalar Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[6]  Sangyeun Cho,et al.  Decoupling local variable accesses in a wide-issue superscalar processor , 1999, ISCA.

[7]  Andreas Moshovos,et al.  Dynamic Speculation and Synchronization of Data Dependences , 1997, ISCA.

[8]  Todd M. Austin,et al.  Zero-cycle loads: microarchitecture support for reducing load latency , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[9]  Joel S. Emer,et al.  Memory dependence prediction using store sets , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[10]  FranklinManoj,et al.  High-bandwidth data memory systems for superscalar processors , 1991 .

[11]  Mike Johnson,et al.  Superscalar microprocessor design , 1991, Prentice Hall series in innovative technology.

[12]  S SohiGurindar Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers , 1990 .

[13]  Robert G. Wedig,et al.  A performance analysis of automatically managed top of stack buffers , 1987, ISCA '87.

[14]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[15]  Vicki H. Allan,et al.  Petri net versus module scheduling for software pipelining , 1995, MICRO 1995.

[16]  Carlo H. Séquin,et al.  A VLSI RISC , 1982, Computer.

[17]  Mikko H. Lipasti,et al.  Superspeculative Microarchitecture for Beyond AD 2000 , 1997, Computer.

[18]  Michael J. Flynn,et al.  Execution Architecture: The DELtran Experiment , 1983, IEEE Transactions on Computers.

[19]  Stamatis Vassiliadis,et al.  A load-instruction unit for pipelined processors , 1993, IBM J. Res. Dev..

[20]  Michael J. Flynn,et al.  Computer Architecture: Pipelined and Parallel Processor Design , 1995 .

[21]  Douglas W. Clark,et al.  A Characterization of Processor Performance in the vax-11/780 , 1984, ISCA '84.

[22]  Andreas Moshovos,et al.  Streamlining inter-operation memory communication via data dependence prediction , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[23]  Christian Piguet,et al.  Microprocessor design , 1997 .

[24]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[25]  Mikko H. Lipasti,et al.  Value locality and load value prediction , 1996, ASPLOS VII.

[26]  James E. Smith,et al.  The predictability of data values , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[27]  Thorsten von Eicken,et al.  技術解説 IEEE Computer , 1999 .

[28]  Yale N. Patt,et al.  Increasing the instruction fetch rate via multiple branch prediction and a branch address cache , 1993, ICS '93.

[29]  H LipastiMikko,et al.  Value locality and load value prediction , 1996 .

[30]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[31]  David R. Ditzel,et al.  Register allocation for free: The C machine stack cache , 1982, ASPLOS I.

[32]  S. McFarling Combining Branch Predictors , 1993 .

[33]  Quinn Jacobson,et al.  Trace processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[34]  Gary S. Tyson,et al.  On high-bandwidth data cache design for multi-issue processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[35]  Yale N. Patt,et al.  One Billion Transistors, One Uniprocessor, One Chip , 1997, Computer.

[36]  Eric Rotenberg,et al.  Trace cache: a low latency approach to high bandwidth instruction fetching , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[37]  Gurindar S. Sohi,et al.  High-bandwidth data memory systems for superscalar processors , 1991, ASPLOS IV.