Multiprocessor performance debugging and memory bottlenecks

Driven by the computational demands of scientists and engineers, computer architects are building increasingly complex multiprocessor systems. However, while the peak Gigaflop ratings of such systems are often impressive, the actual performance of initial implementations of applications can be disappointing. To make the task of performance debugging manageable, tools are needed that can analyze program behavior and report sources of performance loss. This dissertation describes techniques for building such tools for shared memory multiprocessors. Previous efforts to build performance debugging systems for shared memory multiprocessors had two shortcomings. First, though memory hierarchy performance is often critical to whole program performance, most tools cannot distinguish the time the CPU is computing from the time when it is stalled waiting on the memory hierarchy. Second, many tools significantly perturb a program's execution adding 50% or more overhead, making it difficult to measure the behavior of the original uninstrumented code. This dissertation addresses both of these problems. It describes a software instrumentation system, Mtool, that typically increases program execution time by less than 10% while collecting a detailed profile of where processors are doing work, waiting for work, or stalled waiting on the memory hierarchy. A window-based user interface allows the user to interpret the profile, viewing compute, memory, and synchronization bottlenecks at increasing levels of detail, from a whole program level down to the level of individual procedures, loops, and synchronization objects. In addition to introducing Mtool, we present extensive data on the overhead of collecting basic block count profiles, provide a characterization of the memory overheads in the SPEC benchmark suite, and offer several case studies to illustrate how the features of Mtool have helped users to improve multiprocessor program performance.

[1]  W. G. Morris,et al.  CCG: a prototype coagulating code generator , 1991, PLDI '91.

[2]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[3]  Matthew H. Reilly A performance monitor for parallel programs , 1990 .

[4]  James H. Patterson,et al.  Portable Programs for Parallel Processors , 1987 .

[5]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[6]  Alan Dain Samples,et al.  Profile-Driven Compilation , 1991 .

[7]  James R. Larus,et al.  Cache considerations for multiprocessor programmers , 1990, CACM.

[8]  Scott McFarling,et al.  Procedure merging with instruction caches , 1991, PLDI '91.

[9]  Ken Kennedy,et al.  Analyzing and visualizing performance of memory hierarchies , 1990 .

[10]  Hendrik A. Goosen,et al.  Paradigm: a highly scalable shared-memory multicomputer architecture , 1991, Computer.

[11]  Josep Torrellas,et al.  Share Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates , 1990, ICPP.

[12]  John L. Hennessy,et al.  Multiprocessor Simulation and Tracing Using Tango , 1991, ICPP.

[13]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[14]  Allen D. Malony,et al.  Faust: an integrated environment for parallel programming , 1989, IEEE Software.

[15]  Allen D. Malony,et al.  Models for performance perturbation analysis , 1991, PADD '91.

[16]  Raymond R. Glenn,et al.  Instrumentation for a Massively Parallel MIMD Application , 1991, J. Parallel Distributed Comput..

[17]  Zary Segall,et al.  Visualizing performance debugging , 1989, Computer.

[18]  Vivek Sarkar,et al.  Determining average program execution times and their variance , 1989, PLDI '89.

[19]  Margaret Martonosi,et al.  MemSpy: analyzing memory system bottlenecks in programs , 1992, SIGMETRICS '92/PERFORMANCE '92.

[20]  J. Mcdonald,et al.  Vectorization of a particle simulation method for hypersonic rarefied flow , 1988 .

[21]  Thomas J. Leblanc,et al.  Analyzing Parallel Program Executions Using Multiple Views , 1990, J. Parallel Distributed Comput..

[22]  Thomas L. Sterling,et al.  Multiprocessor Performance Measurement Using Embedded Instrumentation , 1988, ICPP.

[23]  Alan Mink,et al.  Multiprocessor performance-measurement instrumentation , 1990, Computer.

[24]  Susan J. Eggers,et al.  Eliminating False Sharing , 1991, ICPP.

[25]  Kathleen M. Nichols Performance tools , 1990, IEEE Software.

[26]  Allen D. Malony,et al.  Performance Prediction for Parallel Numerical Algorithms , 1991, Int. J. High Speed Comput..

[27]  William Jalby,et al.  Impact of Hierarchical Memory Systems On Linear Algebra Algorithm Design , 1988 .

[28]  Ashok K. Agrawala,et al.  Exmon: a tool for resource monitoring of programs , 1991 .

[29]  Barton P. Miller,et al.  IPS-2: The Second Generation of a Parallel Program Measurement System , 1990, IEEE Trans. Parallel Distributed Syst..

[30]  James Arthur Kohl,et al.  A Tool to Aid in the Design, Implementation, and Understanding of Matrix Algorithms for Parallel Processors , 1990, J. Parallel Distributed Comput..

[31]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[32]  Anthony Bolmarcich,et al.  Performance Visualization of Parallel Programs on a Shared Memory Multiprocessor System , 1989, ICPP.

[33]  Ilya Gertner,et al.  Non-intrusive and interactive profiling in parasight , 1988, PPoPP 1988.

[34]  Barton P. Miller,et al.  The integration of application and system based metrics in a parallel program performance tool , 1991, PPOPP '91.

[35]  Helmar Burkhart,et al.  Performance-Measurement Tools in a Multiprocessor Environment , 1989, IEEE Trans. Computers.