Design of an efficient communication infrastructure for highly contended locks in many-core CMPs

Lock synchronization is a key programming primitive for shared-memory many-core CMPs. However, as the number of cores increases, conventional software implementations cannot meet the desirable levels of performance and scalability. Meanwhile, most existing hardware-supported lock proposals require modifications at some level of the memory hierarchy, thus degrading QoS of applications through synchronization traffic. In this paper, we propose GLock, a dedicated network infrastructure and a token-based message-passing protocol to provide a non-intrusive, extremely efficient and fair implementation for highly contended locks. Two implementations of GLock are considered. The first leverages current full-custom G-lines technology, whilst the second uses a cost-effective mainstream industrial toolflow with an advanced 45 nm technology. When compared with the most efficient software-based lock, both alternatives provide significant reductions in execution time, network traffic and power consumption, for a representative set of benchmarks, with negligible area overhead.

[1]  Sanjeev Kumar,et al.  Evaluating synchronization on shared address space multiprocessors: methodology and performance , 1999, SIGMETRICS '99.

[2]  Pat Conway,et al.  Blade computing with the AMD Opteron™ processor ("magny-cours") , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[3]  Nathan R. Tallent,et al.  Analyzing lock contention in multithreaded applications , 2010, PPoPP '10.

[4]  M. Erez,et al.  Express Virtual Channels with Capacitively Driven Global Links , 2009, IEEE Micro.

[5]  Christoforos E. Kozyrakis,et al.  Comparing memory systems for chip multiprocessors , 2007, ISCA '07.

[6]  Frank Mueller,et al.  Token-Based Read/Write-Locks for Distributed Mutual Exclusion , 2000, Euro-Par.

[7]  Guang R. Gao,et al.  Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures , 2007, ISCA '07.

[8]  K.L. Shepard,et al.  Distributed Loss-Compensation Techniques for Energy-Efficient Low-Latency On-Chip Communication , 2007, IEEE Journal of Solid-State Circuits.

[9]  José L. Abellán,et al.  A G-Line-Based Network for Fast and Efficient Barrier Synchronization in Many-Core CMPs , 2010, 2010 39th International Conference on Parallel Processing.

[10]  Gianluca Palermo,et al.  An efficient synchronization technique for multiprocessor systems on-chip , 2006, MEDEA '05.

[11]  Christopher J. Hughes,et al.  RSIM: Simulating Shared-Memory Multiprocessors with ILP Processors , 2002, Computer.

[12]  S. Wong,et al.  Near speed-of-light signaling over on-chip electrical interconnects , 2003 .

[13]  K. Okada,et al.  A Bidirectional- and Multi-Drop-Transmission-Line Interconnect for Multipoint-to-Multipoint On-Chip Communications , 2008, IEEE Journal of Solid-State Circuits.

[14]  James R. Goodman,et al.  Transactional lock-free execution of lock-based programs , 2002, ASPLOS X.

[15]  Beng-Hong Lim,et al.  Reactive synchronization algorithms for multiprocessors , 1994, ASPLOS VI.

[16]  Anant Agarwal,et al.  Smartlocks: lock acquisition scheduling for self-aware synchronization , 2010, ICAC '10.

[17]  José L. Abellán,et al.  GLocks: Efficient Support for Highly-Contended Locks in Many-Core CMPs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[18]  John B. Carter,et al.  MP-LOCKs: replacing H/W synchronization primitives with message passing , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[19]  William N. Scherer,et al.  Scalable queue-based spin locks with timeout , 2001, PPoPP '01.

[20]  Manuel E. Acacio,et al.  Sim-PowerCMP: A Detailed Simulator for Energy Consumption Analysis in Future Embedded CMP Architectures , 2007, 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW'07).

[21]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[22]  Milos Prvulovic,et al.  TLSync: Support for multiple fast barriers using on-chip transmission lines , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[23]  Richard McDougall,et al.  Solaris internals : core kernel components , 2001 .

[24]  John Sartori,et al.  Low-Overhead, High-Speed Multi-core Barrier Synchronization , 2010, HiPEAC.

[25]  Dawei Huang,et al.  A 40 nm 16-Core 128-Thread SPARC SoC Processor , 2011, IEEE Journal of Solid-State Circuits.

[26]  Thomas E. Anderson,et al.  The Performance Implications of Spin-Waiting Alternatives for Shared-Memory Multiprocessors , 1989, ICPP.

[27]  James R. Goodman,et al.  Efficient Synchronization: Let Them Eat QOLB , 1997, International Symposium on Computer Architecture.

[28]  Justin Schauer,et al.  High Speed and Low Energy Capacitively Driven On-Chip Wires , 2008, IEEE Journal of Solid-State Circuits.

[29]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[30]  Hugh Garraway Parallel Computer Architecture: A Hardware/Software Approach , 1999, IEEE Concurrency.