Memory Referencing Behavior in Compiler-Parallelized Applications

Compiler-parallelized applications are increasing in importance as moderate-scale multiprocessors become common. This paper evaluates how features of advanced memory systems (e.g., longer cache lines) impact memory system behavior for applications amenable to compiler parallelization. Using full-sized input data sets and applications taken from the SPEC, NAS, PERFECT, and RICEPS benchmark suites, we measure statistics such as speedups, memory costs, causes of cache misses, cache line utilization, and data traffic. This exploration allows us to draw several conclusions. First, we find that larger granularity parallelism often correlates with good memory system behavior, good overall performance, and high speedup in these applications. Second, we show that when long (512 byte) cache lines are used, many of these applications suffer from false sharing and low cache line utilization. Third, we identify some of the common artifacts in compiler-parallelized codes that can lead to false sharing or other types of poor memory system performance, and we suggest methods for improving them. Overall, this study offers both an important snapshot of the behavior of applications compiled by state-of-the-art compilers, as well as an increased understanding of the interplay between cache line size, program granularity, and memory performance in moderate-scale multiprocessors.

[1]  Margaret Martonosi,et al.  MemSpy: analyzing memory system bottlenecks in programs , 1992, SIGMETRICS '92/PERFORMANCE '92.

[2]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[3]  Michael L. Scott,et al.  False sharing and its effect on shared memory performance , 1993 .

[4]  Mary W. Hall,et al.  Interprocedural Parallelization Analysis: A Case Study , 1995, PPSC.

[5]  Livio Ricciulli,et al.  The detection and elimination of useless misses in multiprocessors , 1993, ISCA '93.

[6]  Anoop Gupta,et al.  The Stanford FLASH multiprocessor , 1994, ISCA '94.

[7]  John L. Hennessy,et al.  Multiprocessor Simulation and Tracing Using Tango , 1991, ICPP.

[8]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[9]  James R. Larus,et al.  Tempest and typhoon: user-level shared memory , 1994, ISCA '94.

[10]  Chau-Wen Tseng,et al.  Unified compilation techniques for shared and distributed address space machines , 1995, ICS '95.

[11]  Steven W. K. Tjiang,et al.  SUIF: an infrastructure for research on parallelizing and optimizing compilers , 1994, SIGP.

[12]  Rudolf Eigenmann,et al.  Performance Analysis of Parallelizing Compilers on the Perfect Benchmarks Programs , 1992, IEEE Trans. Parallel Distributed Syst..

[13]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[14]  Randy H. Katz,et al.  The effect of sharing on the cache and bus performance of parallel programs , 1989, ASPLOS III.

[15]  Mary W. Hall,et al.  Detecting Coarse - Grain Parallelism Using an Interprocedural Parallelizing Compiler , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[16]  Margaret Martonosi,et al.  Analyzing and tuning memory performance in sequential and parallel programs , 1994 .

[17]  Randy H. Katz,et al.  The effect of sharing on the cache and bus performance of parallel programs , 1989, ASPLOS 1989.

[18]  C. Natarajan,et al.  Measurement-based characterization of global memory and network contention, operating system and parallelisation overheads: case study on a shared-memory multiprocessor , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[19]  Susan J. Eggers,et al.  Reducing false sharing on shared memory multiprocessors through compile time data transformations , 1995, PPOPP '95.

[20]  Pen-Chung Yew,et al.  The effectiveness of caches and data prefetch buffers in large-scale shared memory multiprocessors , 1987 .

[21]  David J. Lilja,et al.  The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared Memory Multiprocessor , 1994, IEEE Trans. Parallel Distributed Syst..

[22]  Anoop Gupta,et al.  Cache Invalidation Patterns in Shared-Memory Multiprocessors , 1992, IEEE Trans. Computers.

[23]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[24]  Josep Torrellas,et al.  False Sharing ans Spatial Locality in Multiprocessor Caches , 1994, IEEE Trans. Computers.

[25]  Harry A. G. Wijshoff,et al.  Managing pages in shared virtual memory systems: getting the compiler into the game , 1993, ICS '93.

[26]  Stephen R. Goldschmidt,et al.  Simulation of multiprocessors: accuracy and performance , 1993 .

[27]  Susan J. Eggers,et al.  Eliminating False Sharing , 1991, ICPP.

[28]  Monica S. Lam,et al.  Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.

[29]  Sanjay Sharma,et al.  Measurement-based characterization of global memory and network contention, operating system and parallelization overheads , 1994, ISCA '94.