Exploiting Load Latency Tolerance in Dynamically Scheduled Processors

This paper provides quantitative measurements of load latency tolerance in a dynamically scheduled processor and presents one cache management technique that exploits this information to improve overall performance. We determine the latency of each memory load operation such that the number of instructions issued per cycle (IPC) is comparable to an ideal memory system that satisfies all requests in a single cycle. Our measurements reveal that to produce IPC values within 16% of the ideal memory system, between 50% and 90% of loads need to be satisfied within a single cycle and that up to 50% can be satisfied in as many as 8 cycles (an artificially imposed upper limit), depending on the benchmark and processor configuration. Load latency tolerance is largely determined by the number of dependent operations and whether a branch is dependent on the load. This paper presents an all hardware approach to obtain this information and to utilize it in determining cache replacement decisions. Simulation results indicate this technique can improve IPC values by up to 8%.

[1]  Gurindar S. Sohi,et al.  Dynamic Instruction Reuse , 1997, ISCA.

[2]  Kevin Skadron,et al.  Design issues and tradeoffs for write buffers , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[3]  J. E. Thornton,et al.  Parallel operation in the control data 6600 , 1964, AFIPS '64 (Fall, part II).

[4]  S SohiGurindar Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers , 1990 .

[5]  R. M. Tomasulo,et al.  An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[6]  Trevor N. Mudge,et al.  Correlation and Aliasing in Dynamic Branch Predictors , 1996, ISCA.

[7]  Wen-mei W. Hwu,et al.  Run-Time Adaptive Cache Hierarchy Management via Reference Analysis , 1997, ISCA.

[8]  Chuan-lin Wu,et al.  Limitation of superscalar microprocessor performance , 1992, MICRO.

[9]  Lizy Kurian John,et al.  Memory Latency Effects in Decoupled Architectures , 1994, IEEE Trans. Computers.

[10]  Monica S. Lam,et al.  Limits of Control Flow on Parallelism , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[11]  Doug Burger,et al.  Evaluating Future Microprocessors: the SimpleScalar Tool Set , 1996 .

[12]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, ISCA.

[13]  Harry Dwyer,et al.  An out-of-order superscalar processor with speculative execution and fast, precise interrupts , 1992, MICRO 25.