Effectiveness of trace sampling for performance debugging tools

Recently there has been a surge of interest in developing performance debugging tools to help programmers tune their applications for better memory performance [2, 4, 10]. These tools vary both in the detail of feedback provided to the user, and in the run-time overbead of using them. MemSpy [10] is a simulation-based tool which gives programmers detailed statistics on the memory system behavior of applications. It provides information on the frequency and causes of cache misses, and presents it in terms of source-level data and code objects with which the programmer is familiar. However, using MemSpy increases a program's execution time by roughly 10 to 40 fold. This overhead is generally acceptable for applications with execution times of several minutes or less, but it can be inconvenient when tuning applications with very long execution times.This paper examines the use of trace sampling techniques to reduce the execution time overhead of tools like MemSpy. When simulating one tenth of the references, we find that MemSpy's execution time overhead is improved by a factor of 4 to 6. That is, the execution time when using MemSpy is generally within a factor of 3 to 8 times the normal exwution time. With this improved performance, we observe only small errors in the performance statistics reported by MemSpy. On moderate sized caches of 16KB to 128KB, simulating as few as one tenth of the references (in samples of 0.5M references each) allows us to estimate the program's actual cache miss rate with an absolute error no greater than 0.3% on our five benchmarks. These errors are quite tolerable within the context of performance bugging. With larger caches we can also obtain good accuracy by using longer sample lengths. We conclude that, used with care, trace sampling is a powerful technique that makes possible performance debugging tools which provide both detailed memory statistics and low execution time overheads.

[1]  Margaret Martonosi,et al.  MemSpy: analyzing memory system bottlenecks in programs , 1992, SIGMETRICS '92/PERFORMANCE '92.

[2]  James Arthur Kohl,et al.  A Tool to Aid in the Design, Implementation, and Understanding of Matrix Algorithms for Parallel Processors , 1990, J. Parallel Distributed Comput..

[3]  Harold Stuart Stone High-performance computer architecture (2nd ed.) , 1990 .

[4]  Anoop Gupta,et al.  Parallel ICCG on a hierarchical memory multiprocessor - Addressing the triangular solve bottleneck , 1990, Parallel Comput..

[5]  Thomas E. Anderson,et al.  Quartz: a tool for tuning parallel program performance , 1990, SIGMETRICS '90.

[6]  Norman P. Jouppi,et al.  Computer technology and architecture: an evolving interaction , 1991, Computer.

[7]  Janak H. Patel,et al.  Accurate Low-Cost Methods for Performance Evaluation of Cache Memory Systems , 1988, IEEE Trans. Computers.

[8]  David A. Wood,et al.  A Comparison of Trace-Sampling Techniques for Multi-Megabyte Caches , 1994, IEEE Trans. Computers.

[9]  Susan L. Graham,et al.  An execution profiler for modular programs , 1983, Softw. Pract. Exp..

[10]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[11]  Laxmi N. Bhuyan,et al.  High-performance computer architecture , 1995, Future Gener. Comput. Syst..

[12]  John L. Hennessy,et al.  Performance debugging shared memory multiprocessor programs with MTOOL , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[13]  John L. Hennessy,et al.  MTOOL: A Method for Isolating Memory Bottlenecks in Shared Memory Multiprocessor Programs , 1991, ICPP.

[14]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[15]  David A. Wood,et al.  A model for estimating trace-sample miss ratios , 1991, SIGMETRICS '91.