Parallel performance prediction using lost cycles analysis

Most performance debugging and tuning of parallel programs is based on the "measure-modify" approach, which is heavily dependent on detailed measurements of programs during execution. This approach is extremely time consuming and does not lend itself to predicting performance under varying conditions. Analytic modeling and scalability analysis provide predictive power, but are not widely used in practice, due primarily to their emphasis on asymptotic behavior and the difficulty of developing accurate models that work for real world programs. We describe a set of tools for performance tuning of parallel programs that bridges this gap between measurement and modeling. The approach is based on lost cycles analysis, which involves measurement and modeling of all sources of overhead in a parallel program. We first describe a tool for measuring overheads in parallel programs that we have incorporated onto the runtime environment for Fortran programs on the Kendall Square KSR1. We then describe a tool that fits these overhead measurements to analytic forms. We illustrate the use of these tools by analyzing the performance tradeoffs among parallel implementations of 2D FFT. These examples show how our tools enable programmers to develop accurate performance models of parallel applications without requiring extensive performance modeling expertise.<<ETX>>

[1]  Thomas E. Anderson,et al.  Quartz: a tool for tuning parallel program performance , 1990, SIGMETRICS '90.

[2]  Karsten Schwan,et al.  A Language and System for the Construction and Tuning of Parallel Programs , 1988, IEEE Trans. Software Eng..

[3]  Mark Crovella,et al.  Performance debugging using parallel performance predicates , 1993, PADD '93.

[4]  Peter Hinz,et al.  Visualizing the performance of parallel programs , 1996 .

[5]  Ray Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[6]  Mohamed Jamal Zemerly,et al.  A Layered Approach to the Characterisation of Parallel Systems for Performance Prediction , 1993 .

[7]  Michael T. Heath,et al.  Visualizing the performance of parallel programs , 1991, IEEE Software.

[8]  Zary Segall,et al.  Visualizing performance debugging , 1989, Computer.

[9]  Peter Møller-Nielsen,et al.  Problem-heap: A paradigm for multiprocesor algorithms , 1987, Parallel Comput..

[10]  Margaret Martonosi,et al.  MemSpy: analyzing memory system bottlenecks in programs , 1992, SIGMETRICS '92/PERFORMANCE '92.

[11]  Jong-Deok Choi,et al.  A Mechanism for Efficient Debugging of Parallel Programs , 1988, PLDI.

[12]  Thomas R. Gross,et al.  Exploiting task and data parallelism on a multicomputer , 1993, PPOPP '93.

[13]  Vijay P. Kumar,et al.  Analyzing Scalability of Parallel Algorithms and Architectures , 1994, J. Parallel Distributed Comput..

[14]  Jørgen Staunstrup,et al.  Problem-heap: A Paradigm for Multiprocessor Algorithms , 1985 .

[15]  Thomas R. Gross,et al.  Programming Task and Data Parallelism on a Multicomputer. , 1993, PPoPP 1993.

[16]  Daniel P. Siewiorek,et al.  Performance Prediction and Calibration for a Class of Multiprocessors , 1988, IEEE Trans. Computers.

[17]  Ken Kennedy,et al.  Analyzing and visualizing performance of memory hierarchies , 1990 .

[18]  John Zahorjan,et al.  Chores: enhanced run-time support for shared-memory parallel computing , 1993, TOCS.

[19]  James R. Larus,et al.  Optimally profiling and tracing programs , 1992, POPL '92.

[20]  Mark Crovella,et al.  The Search for Lost Cycles: A New Approach to Parallel Program Performance Evaluation , 1993 .

[21]  Mary K. Vernon,et al.  Diagnosing Parallel Program Speedup Limitations Using Resource Contention Models , 1990, ICPP.

[22]  Barton P. Miller,et al.  Optimal tracing and replay for debugging message-passing parallel programs , 1992, Proceedings Supercomputing '92.

[23]  Jong-Deok Choi,et al.  A mechanism for efficient debugging of parallel programs , 1988, PADD '88.

[24]  Mark Crovella,et al.  The Advantages of Multiple Parallelizations in Combinatorial Search , 1994, J. Parallel Distributed Comput..

[25]  J. S. Hunter,et al.  Statistics for experimenters : an introduction to design, data analysis, and model building , 1979 .

[26]  Mark J. Clement,et al.  Analytical performance prediction on multicomputers , 1993, Supercomputing '93. Proceedings.

[27]  Helmar Burkhart,et al.  Performance-Measurement Tools in a Multiprocessor Environment , 1989, IEEE Trans. Computers.

[28]  J. S. Hunter,et al.  Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. , 1979 .

[29]  John L. Hennessy,et al.  Mtool: An Integrated System for Performance Debugging Shared Memory Multiprocessor Applications , 1993, IEEE Trans. Parallel Distributed Syst..