Performance Prediction and Tuning of Parallel Programs

Parallel programs often behave in unexpected ways due to the complex relationship between the structure of a parallel program, the machine on which it is run, the number of processors used, the program''s input, and the measured running time of the program. As a result, performance tuning of parallel programs is an error-prone, time-consuming process. .pp This dissertation describes a set of tools and methods for assisting the programmer in finding the best-performing implementation for a parallel program, and in answering common questions that arise during the performance tuning process. Our approach is based on three contributions: 1) new metrics for the measurement of parallel applications; 2) a new approach to the analysis of parallel program performance; and 3) a new modelling method that allows the programmer to predict the performance of a program in advance of a complete implementation. The metrics, which we call performance predicates, provide measurements that are amenable to analysis, and yet completely capture parallel overheads. The analysis method, lost cycles analysis, applies algorithmic analysis to parallel overheads, assisted by an on-line tool. The modelling method allows lost cycles analysis to be applied to program fragments, and provides rules for aggregating analytic results into a model for the execution time of a (possibly not-yet-implemented) parallel application. We use implementations of subgraph isomorphism and 2D FFT on the SGI Challenge Series and KSR1 multiprocessors to illustrate our methods and tools, and show how our approach can be used to explain surprising performance results and predict the performance of alternative implementations of an application in advance of implementation, while avoiding large numbers of measurements for performance tuning.

[1]  Ken Kennedy,et al.  Analyzing and visualizing performance of memory hierarchies , 1990 .

[2]  Thomas E. Anderson,et al.  Quartz: a tool for tuning parallel program performance , 1990, SIGMETRICS '90.

[3]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[4]  Robert E. Benner,et al.  Development of Parallel Methods for a $1024$-Processor Hypercube , 1988 .

[5]  Ken Kennedy,et al.  A static performance estimator to guide data partitioning decisions , 1991, PPOPP '91.

[6]  Richard M. Karp,et al.  A Survey of Parallel Algorithms for Shared-Memory Machines , 1988 .

[7]  R. T. Dimpsey,et al.  Performance prediction and tuning on a multiprocessor , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[8]  Thomas J. Leblanc,et al.  Analyzing Parallel Program Executions Using Multiple Views , 1990, J. Parallel Distributed Comput..

[9]  Thomas J. LeBlanc,et al.  Problem Decomposition and Communication Tradeoffs in a Shared-Memory Multiprocessor , 1988 .

[10]  Dennis Gannon,et al.  Performance evaluation and prediction for parallel algorithms on the BBN GP1000 , 1990, ICS '90.

[11]  Thomas R. Gross,et al.  Programming Task and Data Parallelism on a Multicomputer. , 1993, PPoPP 1993.

[12]  Allen D. Malony,et al.  Performance Prediction for Parallel Numerical Algorithms , 1991, Int. J. High Speed Comput..

[13]  John L. Gustafson,et al.  Reevaluating Amdahl's law , 1988, CACM.

[14]  J. S. Hunter,et al.  Statistics for experimenters : an introduction to design, data analysis, and model building , 1979 .

[15]  Frederica Darema,et al.  A Speedup Analyzer for Parallel Programs , 1987, ICPP.

[16]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[17]  Barton P. Miller,et al.  Optimal tracing and replay for debugging message-passing parallel programs , 1992, Proceedings Supercomputing '92.

[18]  K. G. Lockyer An introduction to critical path analysis , 1965 .

[19]  Pankaj Mehra,et al.  Automated modeling of message-passing programs , 1994, Proceedings of International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[20]  Arif Ghafoor,et al.  PAWS: a performance evaluation tool for parallel computing systems , 1991, Computer.

[21]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[22]  Edward D. Lazowska,et al.  Speedup Versus Efficiency in Parallel Systems , 1989, IEEE Trans. Computers.

[23]  Ray Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[24]  John L. Hennessy,et al.  Mtool: An Integrated System for Performance Debugging Shared Memory Multiprocessor Applications , 1993, IEEE Trans. Parallel Distributed Syst..

[25]  H. O. Hartley,et al.  Universal Bounds for Mean Range and Extreme Observation , 1954 .

[26]  Mohamed Jamal Zemerly,et al.  A Layered Approach to the Characterisation of Parallel Systems for Performance Prediction , 1993 .