论文信息 - Design of an efficient communication infrastructure for highly contended locks in many-core CMPs

Design of an efficient communication infrastructure for highly contended locks in many-core CMPs

Lock synchronization is a key programming primitive for shared-memory many-core CMPs. However, as the number of cores increases, conventional software implementations cannot meet the desirable levels of performance and scalability. Meanwhile, most existing hardware-supported lock proposals require modifications at some level of the memory hierarchy, thus degrading QoS of applications through synchronization traffic. In this paper, we propose GLock, a dedicated network infrastructure and a token-based message-passing protocol to provide a non-intrusive, extremely efficient and fair implementation for highly contended locks. Two implementations of GLock are considered. The first leverages current full-custom G-lines technology, whilst the second uses a cost-effective mainstream industrial toolflow with an advanced 45 nm technology. When compared with the most efficient software-based lock, both alternatives provide significant reductions in execution time, network traffic and power consumption, for a representative set of benchmarks, with negligible area overhead.

José L. Abellán | Manuel E. Acacio | Juan Fernández Peinador | M. Acacio | Juan Fernández

[1] Sanjeev Kumar,et al. Evaluating synchronization on shared address space multiprocessors: methodology and performance , 1999, SIGMETRICS '99.

[2] Pat Conway,et al. Blade computing with the AMD Opteron™ processor ("magny-cours") , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[3] Nathan R. Tallent,et al. Analyzing lock contention in multithreaded applications , 2010, PPoPP '10.

[4] M. Erez,et al. Express Virtual Channels with Capacitively Driven Global Links , 2009, IEEE Micro.

[5] Christoforos E. Kozyrakis,et al. Comparing memory systems for chip multiprocessors , 2007, ISCA '07.

[6] Frank Mueller,et al. Token-Based Read/Write-Locks for Distributed Mutual Exclusion , 2000, Euro-Par.

[7] Guang R. Gao,et al. Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures , 2007, ISCA '07.

[8] K.L. Shepard,et al. Distributed Loss-Compensation Techniques for Energy-Efficient Low-Latency On-Chip Communication , 2007, IEEE Journal of Solid-State Circuits.

[9] José L. Abellán,et al. A G-Line-Based Network for Fast and Efficient Barrier Synchronization in Many-Core CMPs , 2010, 2010 39th International Conference on Parallel Processing.

[10] Gianluca Palermo,et al. An efficient synchronization technique for multiprocessor systems on-chip , 2006, MEDEA '05.

[11] Christopher J. Hughes,et al. RSIM: Simulating Shared-Memory Multiprocessors with ILP Processors , 2002, Computer.

[12] S. Wong,et al. Near speed-of-light signaling over on-chip electrical interconnects , 2003 .

[13] K. Okada,et al. A Bidirectional- and Multi-Drop-Transmission-Line Interconnect for Multipoint-to-Multipoint On-Chip Communications , 2008, IEEE Journal of Solid-State Circuits.

[14] James R. Goodman,et al. Transactional lock-free execution of lock-based programs , 2002, ASPLOS X.

[15] Beng-Hong Lim,et al. Reactive synchronization algorithms for multiprocessors , 1994, ASPLOS VI.

[16] Anant Agarwal,et al. Smartlocks: lock acquisition scheduling for self-aware synchronization , 2010, ICAC '10.

[17] José L. Abellán,et al. GLocks: Efficient Support for Highly-Contended Locks in Many-Core CMPs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[18] John B. Carter,et al. MP-LOCKs: replacing H/W synchronization primitives with message passing , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[19] William N. Scherer,et al. Scalable queue-based spin locks with timeout , 2001, PPoPP '01.

[20] Manuel E. Acacio,et al. Sim-PowerCMP: A Detailed Simulator for Energy Consumption Analysis in Future Embedded CMP Architectures , 2007, 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW'07).

[21] Anoop Gupta,et al. The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[22] Milos Prvulovic,et al. TLSync: Support for multiple fast barriers using on-chip transmission lines , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[23] Richard McDougall,et al. Solaris internals : core kernel components , 2001 .

[24] John Sartori,et al. Low-Overhead, High-Speed Multi-core Barrier Synchronization , 2010, HiPEAC.

[25] Dawei Huang,et al. A 40 nm 16-Core 128-Thread SPARC SoC Processor , 2011, IEEE Journal of Solid-State Circuits.

[26] Thomas E. Anderson,et al. The Performance Implications of Spin-Waiting Alternatives for Shared-Memory Multiprocessors , 1989, ICPP.

[27] James R. Goodman,et al. Efficient Synchronization: Let Them Eat QOLB , 1997, International Symposium on Computer Architecture.

[28] Justin Schauer,et al. High Speed and Low Energy Capacitively Driven On-Chip Wires , 2008, IEEE Journal of Solid-State Circuits.

[29] Michael L. Scott,et al. Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[30] Hugh Garraway. Parallel Computer Architecture: A Hardware/Software Approach , 1999, IEEE Concurrency.