Using cause-effect analysis to understand the performance of distributed programs

Abstract Understanding the performance of distributed programs can be very difficult, since a program’s performance depends on characteristics of the application, the underlying hard- ware, the software environment, and interactions among all three. In this paper we present cause-effect analysis (CEA), a general approach to understanding distributed program performance that facilitates performance analysis, tuning, and prediction. Using detailed program traces gathered at execution time as input, CEA automatically generates ex- planations for important performance phenomena, identify- ing code segments that are responsible for the occurrence of the phenomena. We illustrate our approach by describing CEA techniques for three classes of overheads in distributed programs: con- tention, synchronization, and communication. Using the ex- planations produced by CEA, we are able to understand and minimize common performance problems in real appli- cations including load imbalance, false sharing, and resource contention.

[1]  Wagner Meira,et al.  Waiting time analysis and performance visualization in Carnival , 1996, SPDT '96.

[2]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[3]  Alan L. Cox,et al.  TreadMarks: shared memory computing on networks of workstations , 1996 .

[4]  Jim Gray,et al.  Benchmark Handbook: For Database and Transaction Processing Systems , 1992 .

[5]  Barton P. Miller,et al.  Critical path analysis for the execution of parallel and distributed programs , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[6]  Thomas J. Leblanc,et al.  Analyzing Parallel Program Executions Using Multiple Views , 1990, J. Parallel Distributed Comput..

[7]  Thomas J. LeBlanc,et al.  Debugging Parallel Programs with Instant Replay , 1987, IEEE Transactions on Computers.

[8]  Jong-Deok Choi,et al.  A mechanism for efficient debugging of parallel programs , 1988, PADD '88.

[9]  James R. Larus,et al.  StormWatch: a tool for visualizing memory system protocols , 1995 .

[10]  Mark Crovella,et al.  Performance debugging using parallel performance predicates , 1993, PADD '93.

[11]  Alan L. Cox,et al.  Performance debugging shared memory parallel programs using run-time dependence analysis , 1997, SIGMETRICS '97.

[12]  Virgílio A. F. Almeida,et al.  The Influence of Geographical and Cultural Issues on the Cache Proxy Server Workload , 1998, Comput. Networks.

[13]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[14]  Jr. Wagner Meira Understanding parallel program performance using cause-effect analysis , 1998 .

[15]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[16]  Nikolaos Hardavellas,et al.  Understanding the Performance of DSM Applications , 1997, CANPC.