论文信息 - Evaluating The Performance of Non-Blocking Synchronisation on Modern Shared-Memory Multiprocessors

Evaluating The Performance of Non-Blocking Synchronisation on Modern Shared-Memory Multiprocessors

Parallel programs running on shared memory multiprocessors coordinate via shared data objects/structures. To ensure the consistency of the shared data structures, programs typically rely on some forms of software synchronisations. Unfortunately typical software synchronisation mechanisms usually result in poor performance because they produce large amounts of memory and interconnection network contention and, more signi cantly, because they produce convoy e ects that degrade signi cantly in multiprogramming environments: if one process holding a lock is preempted, other processes on di erent processors waiting for the lock will not be able to proceed. Researchers have introduced non-blocking synchronisation to address the above problems. However, its performance implications are not well understood on modern systems or on real applications. In this paper we study the impact of the non-blocking synchronisation on parallel applications running on top of a modern, 64 processor, cache-coherent, shared memory multiprocessor system: the SGI Origin 2000. In addition to the performance results on a modern system, we investigate the key synchronisation schemes that are used in multiprocessor applications and their e cient transformation to non-blocking ones.

Yi Zhang | Philippas Tsigas | Yi Zhang | P. Tsigas

[1] Ralph Grishman,et al. The NYU ultracomputer—designing a MIMD, shared-memory parallel machine , 2018, ISCA '98.

[2] Edward D. Lazowska,et al. The Effect of Scheduling Discipline on Spin Overhead in Shared Memory Parallel Systems , 1991, IEEE Trans. Parallel Distributed Syst..

[3] Beng-Hong Lim,et al. Reactive synchronization algorithms for multiprocessors , 1994, ASPLOS VI.

[4] Anoop Gupta,et al. Working sets, cache sizes, and node granularity issues for large-scale multiprocessors , 1993, ISCA '93.

[5] Marc Levoy,et al. Parallel visualization algorithms: performance and architectural implications , 1994, Computer.

[6] Anoop Gupta,et al. The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[7] Marc Levoy,et al. Volume rendering on scalable shared-memory MIMD architectures , 1992, VVS.

[8] Pat Hanrahan,et al. A rapid hierarchical radiosity algorithm , 1991, SIGGRAPH.

[9] Michael L. Scott,et al. Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[10] Yi Zhang,et al. A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems , 2001, SPAA '01.

[11] David R. O'Hallaron,et al. Earthquake ground motion modeling on parallel computers , 1996, Supercomputing '96.

[12] Dimitrios S. Nikolopoulos,et al. A quantitative architectural evaluation of synchronization algorithms and disciplines on ccNUMA systems: the case of the SGI Origin2000 , 1999, ICS '99.

[13] Maged M. Michael,et al. Nonblocking Algorithms and Preemption-Safe Locking on Multiprogrammed Shared Memory Multiprocessors , 1998, J. Parallel Distributed Comput..

[14] D. Brandt,et al. Multi-level adaptive solutions to boundary-value problems math comptr , 1977 .

[15] John L. Hennessy,et al. The performance advantages of integrating block data transfer in cache-coherent multiprocessors , 1994, ASPLOS VI.

[16] James R. Goodman,et al. Efficient Synchronization: Let Them Eat QOLB , 1997, International Symposium on Computer Architecture.

[17] Alexandre E. Eichenberger,et al. Impact of Load Imbalance on the Design of Software Barriers , 1995, ICPP.

[18] Anoop Gupta,et al. The DASH Prototype: Logic Overhead and Performance , 1993, IEEE Trans. Parallel Distributed Syst..

[19] Jaswinder Pal Singh,et al. A methodology and an evaluation of the SGI Origin2000 , 1998, SIGMETRICS '98/PERFORMANCE '98.

[20] D. Lenoski,et al. The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[21] Kenneth C. Yeager. The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[22] T. Lovett,et al. STiNG: A CC-NUMA Computer System for the Commercial Marketplace , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[23] David R. O'Hallaron. Spark98: Sparse Matrix Kernels for Shared Memory and Message Passing Systems , 1997 .

[24] Anoop Gupta,et al. SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.