Long Address Traces from RISC Machines: Generation and Analysis

The accurate analysis of cache designs is becoming ever more important as processor speeds rapidly outstrip memory speeds and cache misses become a more significant factor in system performance. It can easily be shown that a factor of 10 decrease in processor cycle time with no change in the memory system may increase execution speed by only a factor of 2, the difference being due only to the increased relative cost of servicing cache misses. The primary tool for cache analysis is simulation based on address traces of running systems. The accuracy of the results depends both on the simulation model and the accuracy of the trace data. Existing methods of generating and analyzing traces suffer from a variety of limitations including complexity, inaccuracy, lack of system references, short length, inflexibility, or applicability only to CISC machines. We have built a system for generating and analyzing traces that addresses each of these problems. Trace generation is based on link-time code modification that makes the generation of a new trace easy. The slowdown of trace-linked code is small enough to allow reasonably accurate traces of user/system interaction. The system is flexible enough to allow great control of what is traced and when it is traced. On-the-fly analysis removes most limitations on the length of traces. The system is designed for use on RISC machines. In this paper we describe the implementation of trace generation and onthe-fly analysis. We then review preliminary results from the analysis of user traces containing many billions of memory references. Very long traces are both useful and necessary for understanding the behavior of large, fast systems. We experiment with overall size, block size, and associativity in second level caches. Copyright  1989 Digital Equipment Corporation

[1]  Olvi L. Mangasarian Sparsity-preserving sor algorithms for separable quadratic and linear programming , 1984, Comput. Oper. Res..

[2]  Scott McFarling,et al.  Program optimization for instruction caches , 1989, ASPLOS III.

[3]  Deborah Estrin,et al.  Visa Protocols for Controlling Inter-Organizational Datagram Flow : Extended Description , 1989 .

[4]  Willi Hock,et al.  Lecture Notes in Economics and Mathematical Systems , 1981 .

[5]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .

[6]  O. Mangasarian,et al.  Serial and Parallel Solution of Large Scale Linear Programs by Augmented Lagrangian Successive Overrelaxation , 1988 .

[7]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[8]  Alan Jay Smith,et al.  Evaluating Associativity in CPU Caches , 1989, IEEE Trans. Computers.

[9]  David W. Wall,et al.  Global register allocation at link time , 1986, SIGPLAN '86.

[10]  David W. Wall,et al.  The Mahler experience: using an intermediate language as the machine description , 1987, ASPLOS 1987.

[11]  Norman P. Jouppi,et al.  Available instruction-level parallelism for superscalar and superpipelined machines , 1989, ASPLOS 1989.

[12]  N. P. Jouppi,et al.  Integration and packaging plateaus of processor performance , 1989, Proceedings 1989 IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[13]  W. Kent Fuchs,et al.  TRAPEDS: producing traces for multicomputers via execution driven simulation , 1989, SIGMETRICS '89.

[14]  Walter S. Scott,et al.  Magic: A VLSI Layout System , 1984, 21st Design Automation Conference Proceedings.

[15]  Jeremy Dion,et al.  Fast Printed Circuit Board Routing , 1987, 24th ACM/IEEE Design Automation Conference.

[16]  Mark D. Hill,et al.  A case for direct-mapped caches , 1988, Computer.

[17]  Norman P. Jouppi,et al.  Timing Analysis and Performance Improvement of MOS VLSI Designs , 1987, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[18]  N. P. Jouppi,et al.  A unified vector/scalar floating-point architecture , 1989, ASPLOS 1989.

[19]  Jeffrey C. Mogul,et al.  Fragmentation considered harmful , 1987, SIGCOMM '87.

[20]  R. Acevedo,et al.  Research report , 1967, Revista odontologica de Puerto Rico.

[21]  R. BoggsD.,et al.  Measured capacity of an Ethernet , 1995 .

[22]  William R. Hamburgen,et al.  Optimal Finned Heat Sinks , 1986 .

[23]  Mark Horowitz,et al.  Techniques for calculating currents and voltages in VLSI power supply networks , 1990, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[24]  Norman P. Jouppi,et al.  Available instruction-level parallelism for superscalar and superpipelined machines , 1989, ASPLOS III.

[25]  N. P. Jouppi,et al.  A 20-MIPS sustained 32-bit CMOS microprocessor with high ratio of sustained to peak performance , 1989 .

[26]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[27]  R. L. Sites,et al.  ATUM: a new technique for capturing address traces using microcode , 1986, ISCA '86.

[28]  David W. Wall,et al.  The Mahler experience: using an intermediate language as the machine description , 1987, International Conference on Architectural Support for Programming Languages and Operating Systems.

[29]  N. P. Jouppi Architectural and organizational tradeoffs in the design of the MultiTitan CPU , 1989, ISCA '89.

[30]  A. Dain Samples,et al.  Mache: no-loss trace compaction , 1989, SIGMETRICS '89.

[31]  Craig Stanfill,et al.  Connection Machine Is a Registered Trademark of Thinking Machines Corpo- Ration. Cm, Cm-2, Cm-5, and Datavault Are Trademarks of Thinking Machines Corporation. Unix Is a Registered Trademark of At&t , .

[32]  Norman P. Jouppi,et al.  The Distribution of Instruction-Level and Machine Parallelism and Its Effect on Performance , 1999 .

[33]  Christopher A. Kent,et al.  Cache Coherence in Distributed Systems , 1999 .