The effect of sharing on the cache and bus performance of parallel programs

Bus bandwidth ultimately limits the performance, and therefore the scale, of bus-based, shared memory multiprocessors. Previous studies have extrapolated from uniprocessor measurements and simulations to estimate the performance of these machines. In this study, we use traces of parallel programs to evaluate the cache and bus performance of shared memory multiprocessors, in which coherency is maintained by a write-invalidate protocol. In particular, we analyze the effect of sharing overhead on cache miss ratio and bus utilization. Our studies show that parallel programs incur substantially higher miss ratios and bus utilization than comparable uniprocessor programs. The sharing component of these metrics proportionally increases with both cache and block size, and for some cache configurations determines both their magnitude and trend. The amount of overhead depends on the memory reference pattern to the shared data. Programs that exhibit good per-processor-locality perform better than those with fine-grain-sharing. This suggests that parallel software writers and better compiler technology can improve program performance through better memory organization of shared data.

[1]  Cedell Alexander,et al.  Cache memory performance in a unix enviroment , 1986, CARN.

[2]  Alberto L. Sangiovanni-Vincentelli,et al.  A Parallel Simulated Annealing Algorithm for the Placement of Macro-Cells , 1987, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[3]  Robert Olson,et al.  Parallelizing Large Existing Programs: Methodology and Experiences , 1986, COMPCON.

[4]  Alan Jay Smith,et al.  Aspects of cache memory and instruction buffer performance , 1987 .

[5]  Anoop Gupta,et al.  Memory-reference characteristics of multiprocessor applications under MACH , 1988, SIGMETRICS '88.

[6]  Randy H. Katz,et al.  Implementing a cache consistency protocol , 1985, ISCA '85.

[7]  Anant Agarwal,et al.  Multiprocessor cache analysis using ATUM , 1988, ISCA '88.

[8]  Alan Jay Smith Cache Evaluation and the Impact of Workload Choice , 1985, ISCA.

[9]  Randy H. Katz,et al.  An in-cache address translation mechanism , 1986, ISCA '86.

[10]  Thomas P. Murtagh,et al.  Lifetime analysis of dynamically allocated objects , 1988, POPL '88.

[11]  Mark Horowitz,et al.  Performance tradeoffs in cache design , 1988, ISCA '88.

[12]  Srinivas Devadas,et al.  Topological Optimization of Multiple-Level Array Logic , 1987, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[13]  David A. Wood,et al.  SPUR Memory System Architecture , 1988 .

[14]  James R. Larus,et al.  SPUR: A VLSI Multiprocessor Workstation , 1985 .

[15]  Alberto L. Sangiovanni-Vincentelli,et al.  Logic Verification Algorithms and their Parallel Implementation , 1987, 24th ACM/IEEE Design Automation Conference.

[16]  Anoop Gupta,et al.  The VMP multiprocessor: initial experience, refinements, and performance evaluation , 1988, ISCA '88.