Complexity/performance tradeoffs with non-blocking loads

Non-blocking loads are a very effective technique for tolerating the cache-miss latency on data cache references. In this paper, we describe several methods for implementing non-blocking loads. A range of resulting hardware complexity/performance tradeoffs are investigated using an object-code translation and instrumentation system. We have investigated the SPEC92 benchmarks and have found that for the integer benchmarks, a simple hit-under-miss implementation achieves almost all of the available performance improvement for relatively little cost. However, for most of the numeric benchmarks, more expensive implementations are worthwhile. The results also point out the importance of using a compiler capable of scheduling load instructions for cache misses rather than cache hits in non-blocking systems.

[1]  R. Acevedo,et al.  Research report , 1967, Revista odontologica de Puerto Rico.

[2]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[3]  David W. Wall,et al.  Global register allocation at link time , 1986, SIGPLAN '86.

[4]  William R. Hamburgen,et al.  Optimal Finned Heat Sinks , 1986 .

[5]  Jeremy Dion,et al.  Fast Printed Circuit Board Routing , 1987, 24th ACM/IEEE Design Automation Conference.

[6]  Paul John Asente,et al.  Editing graphical objects using procedural representations , 1988 .

[7]  W. R. Hamburgen,et al.  Precise robotic paste dot dispensing , 1989, Proceedings., 39th Electronic Components Conference.

[8]  B. K. Reid,et al.  The USENET cookbook—an experiment in electronic , 1989 .

[9]  David W. Wall,et al.  Link-Time Code Modification , 1989 .

[10]  J. Mogul Network locality at the scale of processes , 1991, TOCS.

[11]  Scott McFarling,et al.  Procedure merging with instruction caches , 1991, PLDI '91.

[12]  David W. Wall,et al.  Systems for Late Code Modification , 1991, Code Generation.

[13]  Jeffrey C. Mogul Network locality at the scale of processes , 1991, SIGCOMM 1991.

[14]  D. W. Wall Predicting program behavior using real or estimated profiles , 1991, PLDI '91.

[15]  Gurindar S. Sohi,et al.  High-bandwidth data memory systems for superscalar processors , 1991, ASPLOS IV.

[16]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[17]  Jeffrey C. Mogul,et al.  Observing TCP dynamics in real networks , 1992, SIGCOMM '92.

[18]  Michael Allen,et al.  Organization of the Motorola 88110 superscalar RISC microprocessor , 1992, IEEE Micro.

[19]  Amitabh Srivastava,et al.  Unreachable procedures in object-oriented programming , 1992, LOPL.

[20]  W. Hamburgen,et al.  Packaging a 150-W bipolar ECL microprocessor , 1992, 1992 Proceedings 42nd Electronic Components & Technology Conference.

[21]  Jeffrey C. Mogul Observing TCP dynamics in real networks , 1992, SIGCOMM 1992.

[22]  Jean-Loup Baer,et al.  Reducing memory latency via non-blocking and prefetching caches , 1992, ASPLOS V.

[23]  Norman P. Jouppi,et al.  A simulation based study of TLB performance , 1992, ISCA '92.

[24]  Nader Vasseghi,et al.  The Mips R4000 processor , 1992, IEEE Micro.

[25]  Anne Rogers,et al.  Software support for speculative loads , 1992, ASPLOS V.

[26]  Anoop Gupta,et al.  Hiding memory latency using dynamic scheduling in shared-memory multiprocessors , 1992, ISCA '92.

[27]  Norman P. Jouppi Cache write policies and performance , 1993, ISCA '93.

[28]  Jeff Yetter,et al.  Performance features of the PA7100 microprocessor , 1993, IEEE Micro.

[29]  Edward McLellan The Alpha AXP architecture and 21064 processor , 1993, IEEE Micro.

[30]  Robert N. Mayo,et al.  Boolean matching for full-custom ECL gates , 1993, Proceedings of 1993 International Conference on Computer Aided Design (ICCAD).

[31]  David W. Wall,et al.  Link-time optimization of address calculation on a 64-bit architecture , 1994, PLDI '94.

[32]  Norman P. Jouppi,et al.  Tradeoffs in two-level on-chip caching , 1994, ISCA '94.

[33]  Jeffrey C. Mogul Recovery in Spritely NFS , 1994, Comput. Syst..

[34]  A. Eustace,et al.  ATOM: a system for building customized program analysis tools , 1994, PLDI '94.

[35]  N. Jouppi,et al.  Complexity/performance tradeoffs with non-blocking loads , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[36]  Jeffrey C. Mogul,et al.  A Better Update Policy , 1994, USENIX Summer.

[37]  Jeffrey C. Mogul,et al.  Fragmentation considered harmful , 1987, SIGCOMM '87.

[38]  Measured capacity of an Ethernet: myths and reality , 1988, CCRV.