Efficient Hardware-Supported Synchronization Mechanisms for Manycores

In this Chapter, we analyze and propose techniques to mitigate the problem of synchronization at server (manycore processor) level in datacenters. Particularly, we propose two different strategies that provide very efficient, scalable and lightweight hardware implementations for barriers and highly-contended locks. We implement our synchronization architectures using two different technologies. The first is a state-of-the-art full-custom technology, namely G-Lines, whilst the second is a costeffective mainstream industrial toolflow with an advanced 45 nm technology, or Standard technology.

[1]  K.L. Shepard,et al.  Distributed Loss-Compensation Techniques for Energy-Efficient Low-Latency On-Chip Communication , 2007, IEEE Journal of Solid-State Circuits.

[2]  M. Erez,et al.  Express Virtual Channels with Capacitively Driven Global Links , 2009, IEEE Micro.

[3]  S. Wong,et al.  Near speed-of-light signaling over on-chip electrical interconnects , 2003 .

[4]  K. Okada,et al.  A Bidirectional- and Multi-Drop-Transmission-Line Interconnect for Multipoint-to-Multipoint On-Chip Communications , 2008, IEEE Journal of Solid-State Circuits.

[5]  Anoop Gupta,et al.  Parallel computer architecture - a hardware / software approach , 1998 .

[6]  John B. Carter,et al.  MP-LOCKs: replacing H/W synchronization primitives with message passing , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[7]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[8]  Nathan R. Tallent,et al.  Analyzing lock contention in multithreaded applications , 2010, PPoPP '10.

[9]  Manuel E. Acacio,et al.  Sim-PowerCMP: A Detailed Simulator for Energy Consumption Analysis in Future Embedded CMP Architectures , 2007, 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW'07).

[10]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[11]  Thomas E. Anderson,et al.  The Performance Implications of Spin-Waiting Alternatives for Shared-Memory Multiprocessors , 1989, ICPP.

[12]  Thomas E. Anderson,et al.  The performance implications of thread management alternatives for shared-memory multiprocessors , 1989, SIGMETRICS '89.

[13]  James R. Goodman,et al.  Transactional lock-free execution of lock-based programs , 2002, ASPLOS X.

[14]  Pen-Chung Yew,et al.  An effective synchronization network for hot-spot accesses , 1992, TOCS.

[15]  Gaël Thomas,et al.  Efficient locking for multicore architectures , 2011 .

[16]  John Sartori,et al.  Low-Overhead, High-Speed Multi-core Barrier Synchronization , 2010, HiPEAC.

[17]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[18]  D. Burger,et al.  Efficient Synchronization: Let Them Eat QOLB /sup1/ , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[19]  Norman P. Jouppi,et al.  Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[20]  Gerard V. Kopcsay,et al.  Packaging the Blue Gene/L supercomputer , 2005, IBM J. Res. Dev..

[21]  Sunil D. Sherlekar Intel Many Integrated Core (MIC) Architecture. , 2012, ICPADS 2012.

[22]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[23]  Bradford M. Beckmann,et al.  TLC: transmission line caches , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[24]  Justin Schauer,et al.  High Speed and Low Energy Capacitively Driven On-Chip Wires , 2008, IEEE J. Solid State Circuits.

[25]  Frank Mueller,et al.  Token-Based Read/Write-Locks for Distributed Mutual Exclusion , 2000, Euro-Par.

[26]  Eisse Mensink,et al.  A 0.28pJ/b 2Gb/s/ch Transceiver in 90nm CMOS for 10mm On-Chip interconnects , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[27]  Mary K. Vernon,et al.  Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS 1989.

[28]  Milos Prvulovic,et al.  TLSync: Support for multiple fast barriers using on-chip transmission lines , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[29]  Gianluca Palermo,et al.  An efficient synchronization technique for multiprocessor systems on-chip , 2006, SIGARCH Comput. Archit. News.

[30]  Beng-Hong Lim,et al.  Reactive synchronization algorithms for multiprocessors , 1994, ASPLOS VI.

[31]  Guang R. Gao,et al.  Optimization of Dense Matrix Multiplication on IBM Cyclops-64: Challenges and Experiences , 2006, Euro-Par.

[32]  Luca Benini,et al.  Design of a collective communication infrastructure for barrier synchronization in cluster-based nanoscale MPSoCs , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[33]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[34]  Edsger W. Dijkstra,et al.  Solution of a problem in concurrent programming control , 1965, CACM.

[35]  Anant Agarwal,et al.  Smartlocks: Self-Aware Synchronization through Lock Acquisition Scheduling , 2009 .

[36]  José E. Moreira,et al.  Evaluation of a multithreaded architecture for cellular computing , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[37]  Anant Agarwal,et al.  Smartlocks: lock acquisition scheduling for self-aware synchronization , 2010, ICAC '10.

[38]  William N. Scherer,et al.  Scalable queue-based spin locks with timeout , 2001, PPoPP '01.

[39]  Sunil Sherlekar Tutorial: Intel many integrated core (MIC) architecture , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[40]  W. Daniel Hillis,et al.  The network architecture of the Connection Machine CM-5 (extended abstract) , 1992, SPAA '92.

[41]  Richard McDougall,et al.  Solaris internals : core kernel components , 2001 .