Memory-system Design Considerations For Dynamically-scheduled Processors

In this paper, we identify performance trends and design relationships between the following components of the data memory hierarchy in a dynamically-scheduled processor: the register file, the lockup-free data cache, the stream buffers, and the interface between these components and the lower levels of the memory hierarchy. Similar performance was obtained from all systems having support for fewer than four in-flight misses, irrespective of the register-file size, the issue width of the processor, and the memory bandwidth. While providing support for more than four in-flight misses did increase system performance, the improvement was less than that obtained by increasing the number of registers. The addition of stream buffers to the investigated systems led to a significant performance increase, with the larger increases for systems having less in-flight-miss support, greater memory bandwidth, or more instruction issue capability. The performance of these systems was not significantly affected by the inclusion of traffic filters, dynamic-stride calculators, or the inclusion of the per-load non-unity stride-predictor and the incremental-prefetching techniques, which we introduce. However, the incremental prefetching technique reduces the bandwidth consumed by stream buffers by 50% without a significant impact on performance.

[1]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[2]  K. K. Ramakrishnan,et al.  Eliminating receive livelock in an interrupt-driven kernel , 1996, TOCS.

[3]  David W. Wall,et al.  A practical system fljr intermodule code optimization at link-time , 1993 .

[4]  Van P. Carey,et al.  Pool Boiling on Small Heat Dissipating Elements in Water at Subatmospheric Pressure , 1999 .

[5]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[6]  Ruben W. Castelino,et al.  Internal Organization of the Alpha 21164, a 300-MHz 64-bit Quad-issue CMOS RISC Microprocessor , 1995, Digit. Tech. J..

[7]  Norman P. Jouppi,et al.  Memory-System Design Considerations for Dynamically-Scheduled Processors , 1997, ISCA.

[8]  Joel F. Bartlett,et al.  Compacting garbage collection with ambiguous roots , 1988, LIPO.

[9]  Scott McFarling,et al.  Procedure merging with instruction caches , 1991, PLDI '91.

[10]  P. Boyle Electrical Evaluation Of The BIPS-0 Package , 1999 .

[11]  Joel F. Bartlett,et al.  Mostly-Copying Garbage Collection Picks Up Generations and C++ , 1999 .

[12]  J. Mogul,et al.  Characterization of Organic Illumination Systems , 1989 .

[13]  Kourosh Gharachorloo,et al.  Fine-grain software distributed shared memory on SMP clusters , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[14]  S. Peter Song,et al.  The PowerPC 604 RISC microprocessor. , 1994, IEEE Micro.

[15]  Norman P. Jouppi,et al.  How useful are non-blocking loads, stream buffers and speculative execution in multiple issue processors? , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[16]  Norman P. Jouppi,et al.  A simulation based study of TLB performance , 1992, ISCA '92.

[17]  B. K. Reid,et al.  The USENET cookbook—an experiment in electronic , 1989 .

[18]  Harry Dwyer,et al.  An out-of-order superscalar processor with speculative execution and fast, precise interrupts , 1992, MICRO 1992.

[19]  William R. Hamburgen,et al.  Optimal Finned Heat Sinks , 1986 .

[20]  Paul John Asente,et al.  Editing graphical objects using procedural representations , 1988 .

[21]  Amitabh Srivastava,et al.  Unreachable procedures in object-oriented programming , 1992, LOPL.

[22]  Jeffrey C. Mogul,et al.  Measured capacity of an Ethernet: myths and reality , 1988, CCRV.

[23]  S. McFarling Combining Branch Predictors , 1993 .

[24]  David W. Wall,et al.  Long Address Traces from RISC Machines: Generation and Analysis , 1999, ISCA 1989.

[25]  Richard E. Kessler,et al.  Evaluating stream buffers as a secondary cache replacement , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[26]  Silvio Turrini,et al.  Optimal group distribution in carry-skip adders , 1989, Proceedings of 9th Symposium on Computer Arithmetic.

[27]  Silvio Turrini Optimizations and Placement with the Genetic Workbench , 1999 .

[28]  John L. Hennessy,et al.  The priority-based coloring approach to register allocation , 1990, TOPL.

[29]  Joel F. Bartlett,et al.  Transparent Controls for Interactive Graphics , 1999 .

[30]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[31]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[32]  Scott McFarling Cache replacement with dynamic exclusion , 1992, ISCA '92.

[33]  Thomas Lengauer,et al.  Combinatorial algorithms for integrated circuit layout , 1990, Applicable theory in computer science.

[34]  Preston Briggs,et al.  Register allocation via graph coloring , 1992 .

[35]  Jeffrey C. Mogul,et al.  The experimental literature of the internet: an annotated bibliography , 1989, CCRV.

[36]  David F. Bacon,et al.  Compiler transformations for high-performance computing , 1994, CSUR.

[37]  Norman P. Jouppi,et al.  Tradeoffs in two-level on-chip caching , 1994, ISCA '94.

[38]  Jeffrey C. Mogul,et al.  Performance Implications of Multiple Pointer Sizes , 1995, USENIX.

[39]  Jeffrey C. Mogul,et al.  The effect of context switches on cache performance , 1991, ASPLOS IV.

[40]  John K. Ousterhout,et al.  Why Aren't Operating Systems Getting Faster As Fast as Hardware? , 1990, USENIX Summer.

[41]  Norman P. Jouppi,et al.  WRL Research Report 93/5: An Enhanced Access and Cycle Time Model for On-chip Caches , 1994 .

[42]  J. S. Fitch,et al.  A comparison of acoustic and infrared inspection techniques for die attach , 1992, [1992 Proceedings] Intersociety Conference on Thermal Phenomena in Electronic Systems.

[43]  Jeffrey C. Mogul,et al.  Network locality at the scale of processes , 1991, SIGCOMM '91.

[44]  Deborah Estrin,et al.  Visa Protocols for Controlling Inter-Organizational Datagram Flow : Extended Description , 1989 .

[45]  John L. Hennessy,et al.  MTOOL: a method for detecting memory bottlenecks , 1991, SIGMETRICS '91.

[46]  G. May Yip Incremental, Generational Mostly-Copying Garbage Collection in Uncooperative Environments , 1999 .

[47]  W. Hamburgen,et al.  Pool boiling enhancement techniques for water at low pressure , 1991, 1991 Proceedings, Seventh IEEE Semiconductor Thermal Measurement and Management Symposium.

[48]  Mark Horowitz,et al.  Piecewise linear models for Rsim , 1993, ICCAD.

[49]  David W. Wall,et al.  Speculative Execution and Instruction-Level Parallelism , 1999 .

[50]  Guang R. Gao,et al.  A Register Allocation Framework Based on Hierarchical Cyclic Interval Graphs , 1992, CC.

[51]  Jeremy Dion,et al.  Contour: a tile-based gridless router , 1995 .

[52]  Robert N. Mayo,et al.  Boolean matching for full-custom ECL gates , 1993, ICCAD '93.

[53]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[54]  Richard L. Sites,et al.  Alpha AXP architecture , 1993, CACM.

[55]  Dirk Grunwald,et al.  Reducing branch costs via branch alignment , 1994, ASPLOS VI.

[56]  David W. Wall,et al.  Systems for Late Code Modification , 1991, Code Generation.

[57]  Jeffrey C. Mogul,et al.  Operating systems support for busy Internet servers , 1995, Proceedings 5th Workshop on Hot Topics in Operating Systems (HotOS-V).

[58]  Norman P. Jouppi,et al.  Circuit and Process Directions for Low-Voltage Swing Submicron BiCMOS , 1999 .

[59]  Brad Calder,et al.  Efficient procedure mapping using cache line coloring , 1997, PLDI '97.

[60]  Norman P. Jouppi,et al.  Register file design considerations in dynamically scheduled processors , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[61]  J. S. Liptay Design of the IBM Enterprise System/9000 high-end processor , 1992, IBM J. Res. Dev..

[62]  Joel F. Bartlett,et al.  Ramonamap—an example of graphical groupware , 1994, UIST '94.

[63]  Jeffrey C. Mogul,et al.  The case for persistent-connection HTTP , 1995, SIGCOMM '95.

[64]  Mark Smotherman,et al.  Efficient DAG construction and heuristic calculation for instruction scheduling , 1991, MICRO 24.

[65]  W. R. Hamburgen,et al.  Precise robotic paste dot dispensing , 1989, Proceedings., 39th Electronic Components Conference.

[66]  Jeffrey C. Mogul,et al.  Simple and Flexible Datagram Access Controls for UNIX-based Gateways , 1999 .

[67]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[68]  Kourosh Gharachorloo,et al.  Shasta: a low overhead, software-only approach for supporting fine-grain shared memory , 1996, ASPLOS VII.

[69]  Joel F. Bartlett,et al.  Don’t Fidget with Widgets, Draw! , 1999 .

[70]  Gurindar S. Sohi,et al.  Register traffic analysis for streamlining inter-operation communication in fine-grain parallel processors , 1992, MICRO 1992.

[71]  Jeffrey C. Mogul,et al.  Observing TCP dynamics in real networks , 1992, SIGCOMM '92.

[72]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[73]  Kourosh Gharachorloo,et al.  Design and performance of the Shasta distributed shared memory protocol , 1997, ICS '97.

[74]  W. Hamburgen,et al.  Packaging a 150-W bipolar ECL microprocessor , 1992, 1992 Proceedings 42nd Electronic Components & Technology Conference.

[75]  Richard L. Sites,et al.  Alpha Architecture Reference Manual , 1995 .

[76]  Jeffrey C. Mogul,et al.  Efficient use of workstations for passive monitoring of local area networks , 1990, SIGCOMM '90.

[77]  Charles N. Fischer,et al.  Probabilistic register allocation , 1992, PLDI '92.

[78]  PA-8000 Combines Complexity and Speed: 11/14/94 , 1994 .

[79]  Yale N. Patt,et al.  An investigation of the performance of various dynamic scheduling techniques , 1992, MICRO 1992.

[80]  Todd C. Mowry,et al.  Tolerating latency through software-controlled data prefetching , 1994 .

[81]  Jeremy Dion,et al.  Fast Printed Circuit Board Routing , 1987, 24th ACM/IEEE Design Automation Conference.

[82]  Dirk Grunwald,et al.  The predictability of branches in libraries , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[83]  Joel F. Bartlett,et al.  Experience with a wireless world wide web client , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[84]  Mike Johnson,et al.  Superscalar microprocessor design , 1991, Prentice Hall series in innovative technology.

[85]  Jeffrey C. Mogul,et al.  The packer filter: an efficient mechanism for user-level network code , 1987, SOSP '87.

[86]  K. J. Richardson Component Characterization for I / O Cache Designs , 1995 .

[87]  Norman P. Jouppi,et al.  Complexity/performance tradeoffs with non-blocking loads , 1994, ISCA '94.

[88]  Russell Kao,et al.  Piecewise Linear Models for Switch-Level Simulation , 1992 .

[89]  Rajiv Gupta,et al.  Predictability of load/store instruction latencies , 1993, Proceedings of the 26th Annual International Symposium on Microarchitecture.

[90]  David W. Wall,et al.  Global register allocation at link time , 1986, SIGPLAN '86.

[91]  David W. Wall,et al.  Software Methods for System Address Tracing: Implementation and Validation , 1999 .

[92]  Joel McCormack,et al.  Writing fast X servers for dumb color frame buffers , 1990, Softw. Pract. Exp..

[93]  Jeffrey C. Mogul,et al.  Network Behavior of a Busy Web Server and its Clients , 1999 .

[94]  John Fitch,et al.  A One-Dimensional Thermal Model for the VAX 9000 Multi Chip Units , 1990 .

[95]  N. P. Jouppi,et al.  Integration and packaging plateaus of processor performance , 1989, Proceedings 1989 IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[96]  Michael N. Nelson,et al.  Virtual Memory vs. The File System , 1999 .

[97]  John Cocke,et al.  A methodology for the real world , 1981 .

[98]  David W. Wall,et al.  Link-time optimization of address calculation on a 64-bit architecture , 1994, PLDI '94.

[99]  David W. Wall,et al.  The Mahler experience: using an intermediate language as the machine description , 1987, International Conference on Architectural Support for Programming Languages and Operating Systems.

[100]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[101]  N. P. Jouppi Architectural and organizational tradeoffs in the design of the MultiTitan CPU , 1989, ISCA '89.

[102]  Dean M. Tullsen,et al.  Simultaneous multithreading: a platform for next-generation processors , 1997, IEEE Micro.

[103]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[104]  Keith D. Cooper,et al.  Improvements to graph coloring register allocation , 1994, TOPL.

[105]  Scott A. Mahlke,et al.  IMPACT: an architectural framework for multiple-instruction-issue processors , 1991, ISCA '91.

[106]  Dave Christie Developing the AMD-K5 architecture , 1996, IEEE Micro.

[107]  Vicki H. Allan,et al.  Software pipelining , 1995, CSUR.

[108]  Norman P. Jouppi,et al.  The Distribution of Instruction-Level and Machine Parallelism and Its Effect on Performance , 1999 .

[109]  Christopher A. Kent,et al.  Cache Coherence in Distributed Systems , 1999 .

[110]  Janak H. Patel,et al.  Stride directed prefetching in scalar processors , 1992, MICRO.

[111]  Jeffrey Mogul,et al.  Spritely NFS: Implementation and Performance of Cache-Consistency Protocols , 1989 .

[112]  David W. Wall,et al.  Experience with a software-defined machine architecture , 1992, TOPL.

[113]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[114]  A. Gibbons Algorithmic Graph Theory , 1985 .

[115]  K. Gharachodoo,et al.  Memory consistency models for shared memory multiprocessors , 1996 .

[116]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[117]  Jean-Loup Baer,et al.  Reducing memory latency via non-blocking and prefetching caches , 1992, ASPLOS V.

[118]  Ramsey W. Haddad Drip: A Schematic Drawing Interpreter , 1999 .

[119]  Norman P. Jouppi,et al.  Available instruction-level parallelism for superscalar and superpipelined machines , 1989, ASPLOS III.

[120]  David W. Wall,et al.  Link-Time Code Modification , 1989 .

[121]  N. P. Jouppi,et al.  A 20-MIPS sustained 32-bit CMOS microprocessor with high ratio of sustained to peak performance , 1989 .

[122]  Norman P. Jouppi Cache write policies and performance , 1993, ISCA '93.

[123]  Susan J. Eggers,et al.  The effect on RISC performance of register set size and structure versus code generation strategy , 1991, ISCA '91.

[124]  Amitabh Srivastava,et al.  Analysis Tools , 2019, Public Transportation Systems.

[125]  David W. Wall,et al.  Predicting program behavior using real or estimated profiles , 2004, SIGP.

[126]  Don Stark,et al.  Analysis of power supply networks in VLSI circuits , 1991 .

[127]  Susan J. Eggers,et al.  Balanced scheduling: instruction scheduling when memory latency is uncertain , 1993, PLDI '93.