Microbenchmarking and Performance Prediction for Parallel Computers

Previous research on this project (in work by Saavedra and Smith) has presented performance evaluation of sequential computers. That work presented (a) measurements of machines at the source language primitive operation level; (b) analysis of standard benchmarks; (c) prediction of run times based on separate measurements of the machines and the programs; (d) analysis of the effectiveness of compiler optimizations; and (e) measurements of the performance and design of cache memories. In this paper, we extend the earlier work to parallel computers. We describe a portable benchmarking suite and performance prediction methodology, which accurately predicts the run times of Fortran 90 programs running upon supercomputers. The benchmarking suite measures the optimization capabilities of a given Fortran 90 compiler, execution rates of abstract Fortran 90 operations, and the processing characteristics of the underlying architecture as exposed by compiler-generated code. To predict the run time of an arbitrary program, we combine our benchmark results with dynamic execution measurements, and augment the resulting prediction with simple factors which account for overhead due to architecturespecific effects, such as remote reference latencies. We measure two supercomputers: a dedicated 128-node TMC CM-5, a distributed memory multiprocessor, and a 4-node partition of a Cray YMP-C90, a tightly-integrated shared memory multiprocessor. Our measurements show that the performance of the YMP-C90 far outstrips that of the CM-5, due to the quality of the compilers available and the architectural characteristics of each machine. To validate our prediction methodology, we predict the run time of five interesting kernels on these machines; nearly all of the predicted run times are within 50-percent of actual run times, much closer than might be expected. The authors’ research has been supported principally for this work by NASA under Grant NCC 2-550, and also in part by the National Science Foundation under grants MIP-9116578 and CCR-9117028, by the State of California under the MICRO program, and by Intel Corporation, Apple Computer Corporation, Sun Microsystems, Digital Equipment Corporation, Philips Laboratories/Signetics, International Business Machines Corporation and Mitsubishi Electric Research Laboratories.

[1]  Jack J. Dongarra,et al.  Parallel loops - a test suite for parallelizing compilers: description and example results , 1991, Parallel Comput..

[2]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[3]  J. Robert Jump,et al.  The rice parallel processing testbed , 1988, SIGMETRICS '88.

[4]  Michael Metcalf,et al.  Fortran 90 Explained , 1990 .

[5]  Alan Jay Smith,et al.  Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes , 1995, IEEE Trans. Computers.

[6]  Rafael Hector Saavedra-Barrera,et al.  CPU performance evaluation and execution time prediction using narrow spectrum benchmarking , 1992 .

[7]  Ken Kennedy,et al.  Evaluation of compiler optimizations for Fortran D on MIMD distributed memory machines , 1992, ICS '92.

[8]  M. J. Carlton,et al.  Micro benchmark analysis of the KSR1 , 1993, Supercomputing '93.

[9]  K.M. Dixit New CPU benchmark suites from SPEC , 1992, Digest of Papers COMPCON Spring 1992.

[10]  Ken Kennedy,et al.  Interprocedural compilation of Fortran D for MIMD distributed-memory machines , 1992, Proceedings Supercomputing '92.

[11]  Wilfried Oed Cray Y-MP C90: System features and early benchmark results (Short communication) , 1992, Parallel Comput..

[12]  Alan Jay Smith,et al.  Performance Characterization of Optimizing Compilers , 1992, IEEE Trans. Software Eng..

[13]  Daniel A. Menascé,et al.  A Methodology for Performance Evaluation of Parallel Applications on Multiprocessors , 1992, J. Parallel Distributed Comput..

[14]  R. E. Kessler,et al.  Cray T3D: a new dimension for Cray Research , 1993, Digest of Papers. Compcon Spring.

[15]  G. Florin,et al.  Stochastic Petri nets: Properties, applications and tools , 1991 .

[16]  Jack J. Dongarra,et al.  A comparative study of automatic vectorizing compilers , 1991, Parallel Comput..

[17]  David A. Padua,et al.  Advanced compiler optimizations for supercomputers , 1986, CACM.

[18]  Rudolf Eigenmann,et al.  Performance Analysis of Parallelizing Compilers on the Perfect Benchmarks Programs , 1992, IEEE Trans. Parallel Distributed Syst..

[19]  W. Daniel Hillis,et al.  The CM-5 Connection Machine: a scalable supercomputer , 1993, CACM.

[20]  Ken Kennedy,et al.  An Overview of the Fortran D Programming System , 1991, LCPC.

[21]  Willi Schönauer,et al.  Performance estimates for supercomputers: the responsibilities of the manufacturer and of the user , 1991, Parallel Comput..

[22]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .