MemSpy: analyzing memory system bottlenecks in programs

To cope with the increasing difference between processor and main memory speeds, modern computer systems use deep memory hierarchies. In the presence of such hierarchies, the performance attained by an application is largely determined by its memory reference behavior—if most references hit in the cache, the performance is significantly higher than if most references have to go to main memory. Frequently, it is possible for the programmer to restructure the data or code to achieve better memory reference behavior. Unfortunately, most existing performance debugging tools do not assist the programmer in this component of the overall performance tuning task. This paper describes MemSpy, a prototype tool that helps programmers identify and fix memory bottlenecks in both sequential and parallel programs. A key aspect of MemSpy is that it introduces the notion of data oriented, in addition to code oriented, performance tuning. Thus, for both source level code objects and data objects, MemSpy provides information such as cache miss rates, causes of cache misses, and in multiprocessors, information on cache invalidations and local versus remote memory misses. MemSpy also introduces a concise matrix presentation to allow programmers to view both code and data oriented statistics at the same time. This paper presents design and implementation issues for MemSpy, and gives a detailed case study using MemSpy to tune a parallel sparse matrix application. It shows how MemSpy helps pinpoint memory system bottlenecks, such as poor spatial locality and interference among data structures, and suggests paths for improvement.

[1]  Anoop Gupta,et al.  Parallel ICCG on a hierarchical memory multiprocessor - Addressing the triangular solve bottleneck , 1990, Parallel Comput..

[2]  James H. Patterson,et al.  Portable Programs for Parallel Processors , 1987 .

[3]  Norman P. Jouppi,et al.  Computer technology and architecture: an evolving interaction , 1991, Computer.

[4]  Larry Rudolph,et al.  PIE: A Programming and Instrumentation Environment for Parallel Processing , 1985, IEEE Software.

[5]  Anoop Gupta,et al.  The DASH prototype: implementation and performance , 1992, ISCA '92.

[6]  Ben Zorn,et al.  A memory allocation profiler for c and lisp , 1988 .

[7]  Helen Davis,et al.  Tango: A Multiprocessor Simulation and Tracing System , 1990 .

[8]  John L. Hennessy,et al.  MTOOL: A Method for Isolating Memory Bottlenecks in Shared Memory Multiprocessor Programs , 1991, ICPP.

[9]  John L. Hennessy,et al.  Performance debugging shared memory multiprocessor programs with MTOOL , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[10]  Ilya Gertner,et al.  Non-intrusive and interactive profiling in parasight , 1988, PPEALS '88.

[11]  E AndersonThomas,et al.  Quartz: a tool for tuning parallel program performance , 1990 .

[12]  James Arthur Kohl,et al.  A Tool to Aid in the Design, Implementation, and Understanding of Matrix Algorithms for Parallel Processors , 1990, J. Parallel Distributed Comput..

[13]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[14]  Anoop Gupta,et al.  Memory-reference characteristics of multiprocessor applications under MACH , 1988, SIGMETRICS '88.

[15]  MartonosiMargaret,et al.  MemSpy: analyzing memory system bottlenecks in programs , 1992 .

[16]  Anoop Gupta,et al.  Memory-reference characteristics of multiprocessor applications under MACH , 1988, SIGMETRICS 1988.

[17]  John L. Hennessy,et al.  Multiprocessor Simulation and Tracing Using Tango , 1991, ICPP.

[18]  Thomas E. Anderson,et al.  Quartz: a tool for tuning parallel program performance , 1990, SIGMETRICS '90.

[19]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[20]  Susan L. Graham,et al.  An execution profiler for modular programs , 1983, Softw. Pract. Exp..

[21]  Iain S. Duff,et al.  Sparse matrix test problems , 1982 .

[22]  D LamMonica,et al.  The cache performance and optimizations of blocked algorithms , 1991 .

[23]  Anoop Gupta,et al.  An evaluation of the Chandy-Misra-Bryant algorithm for digital logic simulation , 1991, TOMC.