Continuous profiling: where have all the cycles gone?

This paper describes the DIGITAL Continuous Profiling Infrastructure, a sampling-based profiling system designed to run continuously on production systems. The system supports multiprocessors, works on unmodified executables, and collects profiles for entire systems, including user programs, shared libraries, and the operating system kernel. Samples are collected at a high rate (over 5200 samples/sec per 333-MHz processor), yet with low overhead (1–3% slowdown for most workloads). Analysis tools supplied with the profiling system use the sample data to produce an accurate accounting, down to the level of pipeline stalls incurred by individual instructions, of where time is being spent. When instructions incur stalls, the tools identify possible reasons, such as cache misses, branch mispredictions, and functional unit contention. The fine-grained instruction-level analysis guides users and automated optimizers to the causes of performance problems and provides important insights for fixing them.

[1]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[2]  David G. Carta,et al.  Two fast implementations of the “minimal standard” random number generator , 1990, CACM.

[3]  Quartz: A Tool for Tuning Parallel Program Performance , 1990, SIGMETRICS.

[4]  Thomas E. Anderson,et al.  Quartz: a tool for tuning parallel program performance , 1990, SIGMETRICS '90.

[5]  James R. Larus,et al.  Optimally profiling and tracing programs , 1992, POPL '92.

[6]  Steven O. Hobbs,et al.  The GEM Optimizing Compiler System , 1992, Digit. Tech. J..

[7]  John L. Hennessy,et al.  Mtool: An Integrated System for Performance Debugging Shared Memory Multiprocessor Applications , 1993, IEEE Trans. Parallel Distributed Syst..

[8]  Keshav Pingali,et al.  The program structure tree: computing control regions in linear time , 1994, PLDI '94.

[9]  James R. Larus,et al.  Optimally Profiling and Tracing , 1994 .

[10]  J. P. Skudlarek,et al.  Program profiling problems, and a solution via machine language rewriting , 1994, SIGP.

[11]  Richard L. Sites,et al.  Alpha AXP architecture reference manual , 1995 .

[12]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[13]  Robert S. Cohn,et al.  Hot cold optimization of large Windows/NT applications , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[14]  S. Turner,et al.  Performance Analysis Using the MIPS R10000 Performance Counters , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[15]  Norman Rubin,et al.  Spike: an optimizer for alpha/NT executables , 1997 .