Reducing Compulsory and Capacity Misses

This paper investigates several methods for reducing cache miss rates. Longer cache lines can be advantageously used to decrease cache miss rates when used in conjunction with miss caches. Prefetch techniques can also be used to reduce cache miss rates. However, stream buffers are better than either of these two approaches. They are shown to have lower miss rates than an optimal line size for each program, and have better or near equal performance to traditional prefetch techniques even when single instruction-issue latency is assumed for prefetches. Stream buffers in conjunction with victim caches can often provide a reduction in miss rate equivalent to a doubling or quadupling of cache size. In some cases the reduction in miss rate provided by stream buffers and victim caches is larger than that of any size cache. Finally, the potential for compiler optimizations to increase the performance of stream buffers is investigated.

[1]  K. K. Ramakrishnan,et al.  Eliminating receive livelock in an interrupt-driven kernel , 1996, TOCS.

[2]  Jeffrey C. Mogul,et al.  The experimental literature of the internet: an annotated bibliography , 1989, CCRV.

[3]  Norman P. Jouppi,et al.  Tradeoffs in two-level on-chip caching , 1994, ISCA '94.

[4]  Jeffrey C. Mogul,et al.  Performance Implications of Multiple Pointer Sizes , 1995, USENIX.

[5]  Norman P. Jouppi,et al.  How useful are non-blocking loads, stream buffers and speculative execution in multiple issue processors? , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[6]  Kourosh Gharachorloo,et al.  Memory consistency models for shared-memory multiprocessors , 1995 .

[7]  Silvio Turrini Optimizations and Placement with the Genetic Workbench , 1999 .

[8]  Jeffrey C. Mogul,et al.  The effect of context switches on cache performance , 1991, ASPLOS IV.

[9]  Norman P. Jouppi,et al.  WRL Research Report 93/5: An Enhanced Access and Cycle Time Model for On-chip Caches , 1994 .

[10]  G. May Yip Incremental, Generational Mostly-Copying Garbage Collection in Uncooperative Environments , 1999 .

[11]  P. Chow,et al.  Memory-system Design Considerations For Dynamically-scheduled Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[12]  Ramsey W. Haddad Drip: A Schematic Drawing Interpreter , 1999 .

[13]  Norman P. Jouppi,et al.  Available instruction-level parallelism for superscalar and superpipelined machines , 1989, ASPLOS III.

[14]  Alan Jay Smith,et al.  Cache Memories , 1982, CSUR.

[15]  David W. Wall,et al.  Link-Time Code Modification , 1989 .

[16]  Paul John Asente,et al.  Editing graphical objects using procedural representations , 1988 .

[17]  Kourosh Gharachorloo,et al.  Fine-grain software distributed shared memory on SMP clusters , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[18]  K. Gharachodoo,et al.  Memory consistency models for shared memory multiprocessors , 1996 .

[19]  Scott McFarling Cache replacement with dynamic exclusion , 1992, ISCA '92.

[20]  W. Hamburgen,et al.  Pool boiling enhancement techniques for water at low pressure , 1991, 1991 Proceedings, Seventh IEEE Semiconductor Thermal Measurement and Management Symposium.

[21]  N. P. Jouppi,et al.  A 20-MIPS sustained 32-bit CMOS microprocessor with high ratio of sustained to peak performance , 1989 .

[22]  David W. Wall,et al.  Speculative Execution and Instruction-Level Parallelism , 1999 .

[23]  Joel F. Bartlett,et al.  Don’t Fidget with Widgets, Draw! , 1999 .

[24]  Jeffrey C. Mogul Observing TCP dynamics in real networks , 1992, SIGCOMM 1992.

[25]  Jeffrey C. Mogul,et al.  Efficient use of workstations for passive monitoring of local area networks , 1990, SIGCOMM '90.

[26]  Jeremy Dion,et al.  Fast Printed Circuit Board Routing , 1987, 24th ACM/IEEE Design Automation Conference.

[27]  John K. Ousterhout,et al.  Why Aren't Operating Systems Getting Faster As Fast as Hardware? , 1990, USENIX Summer.

[28]  J. S. Fitch,et al.  A comparison of acoustic and infrared inspection techniques for die attach , 1992, [1992 Proceedings] Intersociety Conference on Thermal Phenomena in Electronic Systems.

[29]  Jeremy Dion,et al.  Contour: a tile-based gridless router , 1995 .

[30]  Robert N. Mayo,et al.  Boolean matching for full-custom ECL gates , 1993, ICCAD '93.

[31]  Norman P. Jouppi,et al.  The Distribution of Instruction-Level and Machine Parallelism and Its Effect on Performance , 1999 .

[32]  Brad Calder,et al.  Efficient procedure mapping using cache line coloring , 1997, PLDI '97.

[33]  Norman P. Jouppi,et al.  Register file design considerations in dynamically scheduled processors , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[34]  Jeffrey C. Mogul,et al.  Network Behavior of a Busy Web Server and its Clients , 1999 .

[35]  John Fitch,et al.  A One-Dimensional Thermal Model for the VAX 9000 Multi Chip Units , 1990 .

[36]  David W. Wall,et al.  Global register allocation at link time , 1986, SIGPLAN '86.

[37]  N. P. Jouppi,et al.  Integration and packaging plateaus of processor performance , 1989, Proceedings 1989 IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[38]  Michael N. Nelson,et al.  Virtual Memory vs. The File System , 1999 .

[39]  Silvio Turrini,et al.  Optimization in Permutation Spaces , 1999 .

[40]  David W. Wall,et al.  Software Methods for System Address Tracing: Implementation and Validation , 1999 .

[41]  Ramsey W. Haddad,et al.  Recursive layout generation , 1995, Proceedings Sixteenth Conference on Advanced Research in VLSI.

[42]  Norman P. Jouppi,et al.  Available instruction-level parallelism for superscalar and superpipelined machines , 1989, ASPLOS 1989.

[43]  Alan Jay Smith,et al.  Aspects of cache memory and instruction buffer performance , 1987 .

[44]  Jeffrey C. Mogul,et al.  Network locality at the scale of processes , 1991, SIGCOMM '91.

[45]  Mark Horowitz,et al.  Piecewise linear models for Rsim , 1993, ICCAD.

[46]  Anja Feldmann,et al.  Potential benefits of delta encoding and data compression for HTTP , 1997, SIGCOMM '97.

[47]  Russell Kao,et al.  Piecewise Linear Models for Switch-Level Simulation , 1992 .

[48]  Christopher A. Kent,et al.  Cache Coherence in Distributed Systems , 1999 .

[49]  Jeffrey Mogul,et al.  Spritely NFS: Implementation and Performance of Cache-Consistency Protocols , 1989 .

[50]  Jeffrey C. Mogul,et al.  The case for persistent-connection HTTP , 1995, SIGCOMM '95.

[51]  W. R. Hamburgen,et al.  Precise robotic paste dot dispensing , 1989, Proceedings., 39th Electronic Components Conference.

[52]  Norman P. Jouppi Cache write policies and performance , 1993, ISCA '93.

[53]  John L. Hennessy,et al.  MTOOL: a method for detecting memory bottlenecks , 1991, SIGMETRICS '91.

[54]  David W. Wall,et al.  A practical system fljr intermodule code optimization at link-time , 1993 .

[55]  M. K. Farrens,et al.  Improving performance of small on-chip instruction caches , 1989, ISCA '89.

[56]  Amitabh Srivastava,et al.  Analysis Tools , 2019, Public Transportation Systems.

[57]  David W. Wall,et al.  Predicting program behavior using real or estimated profiles , 2004, SIGP.

[58]  Don Stark,et al.  Analysis of power supply networks in VLSI circuits , 1991 .

[59]  Jeffrey C. Mogul,et al.  The packer filter: an efficient mechanism for user-level network code , 1987, SOSP '87.

[60]  W. Hamburgen,et al.  Boiling binary mixtures at subatmospheric pressures , 1992, [1992 Proceedings] Intersociety Conference on Thermal Phenomena in Electronic Systems.

[61]  Jeffrey C. Mogul,et al.  Observing TCP dynamics in real networks , 1992, SIGCOMM '92.

[62]  Jeffrey C. Mogul A Recovery Protocol for Spritely NFS , 1999 .

[63]  B. K. Reid,et al.  The USENET cookbook—an experiment in electronic , 1989 .

[64]  William R. Hamburgen,et al.  Optimal Finned Heat Sinks , 1986 .

[65]  Amitabh Srivastava,et al.  Unreachable procedures in object-oriented programming , 1992, LOPL.

[66]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[67]  David W. Wall,et al.  The Mahler experience: using an intermediate language as the machine description , 1987, ASPLOS 1987.

[68]  Jeffrey C. Mogul Recovery in Spritely NFS , 1994, Comput. Syst..

[69]  David W. Wall,et al.  Systems for Late Code Modification , 1991, Code Generation.

[70]  Jeffrey C. Mogul,et al.  Operating systems support for busy Internet servers , 1995, Proceedings 5th Workshop on Hot Topics in Operating Systems (HotOS-V).

[71]  Norman P. Jouppi,et al.  Circuit and Process Directions for Low-Voltage Swing Submicron BiCMOS , 1999 .

[72]  Joel F. Bartlett,et al.  Transparent Controls for Interactive Graphics , 1999 .

[73]  Jeffrey C. Mogul,et al.  Measured capacity of an Ethernet: myths and reality , 1988, CCRV.

[74]  S. McFarling Combining Branch Predictors , 1993 .

[75]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[76]  Joel F. Bartlett,et al.  Compacting garbage collection with ambiguous roots , 1988, LIPO.

[77]  Joel F. Bartlett,et al.  Ramonamap—an example of graphical groupware , 1994, UIST '94.

[78]  Jeffrey C. Mogul,et al.  Simple and Flexible Datagram Access Controls for UNIX-based Gateways , 1999 .

[79]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[80]  Alan Jay Smith,et al.  Line (Block) Size Choice for CPU Cache Memories , 1987, IEEE Transactions on Computers.

[81]  Kourosh Gharachorloo,et al.  Shasta: a low overhead, software-only approach for supporting fine-grain shared memory , 1996, ASPLOS VII.

[82]  Jeffrey C. Mogul,et al.  A Better Update Policy , 1994, USENIX Summer.

[83]  Van P. Carey,et al.  Pool Boiling on Small Heat Dissipating Elements in Water at Subatmospheric Pressure , 1999 .

[84]  Joel F. Bartlett,et al.  Mostly-Copying Garbage Collection Picks Up Generations and C++ , 1999 .

[85]  J. Mogul,et al.  Characterization of Organic Illumination Systems , 1989 .

[86]  David W. Wall,et al.  Link-time optimization of address calculation on a 64-bit architecture , 1994, PLDI '94.

[87]  David W. Wall,et al.  The Mahler experience: using an intermediate language as the machine description , 1987, International Conference on Architectural Support for Programming Languages and Operating Systems.

[88]  Alan Jay Smith,et al.  Sequential Program Prefetching in Memory Hierarchies , 1978, Computer.

[89]  N. P. Jouppi Architectural and organizational tradeoffs in the design of the MultiTitan CPU , 1989, ISCA '89.

[90]  Dean M. Tullsen,et al.  Simultaneous multithreading: a platform for next-generation processors , 1997, IEEE Micro.

[91]  Bob McNamara,et al.  A Smart Frame Buffer , 1999 .

[92]  Deborah Estrin,et al.  Visa Protocols for Controlling Inter-Organizational Datagram Flow : Extended Description , 1989 .

[93]  B. R Rau Sequential prefetch strategies for instructions and data , 1977 .

[94]  Steven Przybylski The performance impact of block sizes and fetch strategies , 1990, ISCA '90.

[95]  Kourosh Gharachorloo,et al.  Design and performance of the Shasta distributed shared memory protocol , 1997, ICS '97.

[96]  W. Hamburgen,et al.  Packaging a 150-W bipolar ECL microprocessor , 1992, 1992 Proceedings 42nd Electronic Components & Technology Conference.

[97]  Joel F. Bartlett,et al.  Experience with a wireless world wide web client , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[98]  K. J. Richardson Component Characterization for I / O Cache Designs , 1995 .

[99]  Norman P. Jouppi,et al.  Complexity/performance tradeoffs with non-blocking loads , 1994, ISCA '94.

[100]  Joel McCormack,et al.  Writing fast X servers for dumb color frame buffers , 1990, Softw. Pract. Exp..

[101]  David W. Wall,et al.  Experience with a software-defined machine architecture , 1992, TOPL.

[102]  Scott McFarling,et al.  Procedure merging with instruction caches , 1991, PLDI '91.

[103]  P. Boyle Electrical Evaluation Of The BIPS-0 Package , 1999 .

[104]  Dirk Grunwald,et al.  Performance issues in correlated branch prediction schemes , 1995, MICRO 1995.

[105]  Jeremy Dion,et al.  Design Tools for BIPS-0 , 1999 .

[106]  Norman P. Jouppi,et al.  A simulation based study of TLB performance , 1992, ISCA '92.

[107]  David W. Wall,et al.  Long Address Traces from RISC Machines: Generation and Analysis , 1999, ISCA 1989.

[108]  Silvio Turrini,et al.  Optimal group distribution in carry-skip adders , 1989, Proceedings of 9th Symposium on Computer Arithmetic.