MiSAR: Minimalistic synchronization accelerator with resource overflow management

While numerous hardware synchronization mechanisms have been proposed, they either no longer function or suffer great performance loss when their hardware resources are exceeded, or they add significant complexity and cost to handle such resource overflows. Additionally, prior hardware synchronization proposals focus on one type (barrier or lock) of synchronization, so several mechanisms are likely to be needed to support real applications, many of which use locks, barriers, and/or condition variables. This paper proposes MiSAR, a minimalistic synchronization accelerator (MSA) that supports all three commonly used types of synchronization (locks, barriers, and condition variables), and a novel overflow management unit (OMU) that dynamically manages its (very) limited hardware synchronization resources. The OMU allows safe and efficient dynamic transitions between using hardware (MSA) and software synchronization implementations. This allows the MSA's resources to be used only for currently-active synchronization operations, providing significant performance benefits even when the number of synchronization variables used in the program is much larger than the MSA's resources. Because it allows a safe transition between hardware and software synchronization, the OMU also facilitates thread suspend/resume, migration, and other thread-management activities. Finally, the MSA/OMU combination decouples the instruction set support (how the program invokes hardware-supported synchronization) from the actual implementation of the accelerator, allowing different accelerators (or even wholesale removal of the accelerator) in the future without changes to OMU-compatible application or system code. We show that, even with only 2 MSA entries in each tile, the MSA/OMU combination on average performs within 3% of ideal (zero-latency) synchronization, and achieves a speedup of 1.43X over the software (pthreads) implementation.

[1]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[2]  Donald Yeung,et al.  The MIT Alewife machine: architecture and performance , 1995, ISCA '98.

[3]  Constantine D. Polychronopoulos,et al.  Fast barrier synchronization hardware , 1990, Proceedings SUPERCOMPUTING '90.

[4]  William Gropp,et al.  Design and implementation of message-passing services for the Blue Gene/L supercomputer , 2005, IBM J. Res. Dev..

[5]  Norman P. Jouppi,et al.  Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[6]  José L. Abellán,et al.  A G-Line-Based Network for Fast and Efficient Barrier Synchronization in Many-Core CMPs , 2010, 2010 39th International Conference on Parallel Processing.

[7]  Ralph Grishman,et al.  The NYU Ultracomputer—designing a MIMD, shared-memory parallel machine (Extended Abstract) , 1982, ISCA '82.

[8]  Milos Prvulovic,et al.  TLSync: Support for multiple fast barriers using on-chip transmission lines , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[9]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[10]  Ralph Grishman,et al.  The NYU Ultracomputer—designing a MIMD, shared-memory parallel machine (Extended Abstract) , 1982, ISCA 1982.

[11]  Guang R. Gao,et al.  Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures , 2007, ISCA '07.

[12]  William J. Dally,et al.  Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[13]  John T. Robinson A fast general-purpose hardware synchronization mechanism , 1985, SIGMOD '85.

[14]  Men-Chow Chiang,et al.  Memory system design for bus-based multiprocessors , 1992 .

[15]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[16]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[17]  Mateo Valero,et al.  Architectural Support for Fair Reader-Writer Locking , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[18]  Zhen Fang,et al.  Highly efficient synchronization based on active memory operations , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[19]  James R. Goodman,et al.  Efficient Synchronization: Let Them Eat QOLB , 1997, International Symposium on Computer Architecture.

[20]  Fabrizio Petrini,et al.  Scalable collective communication on the ASCI Q machine , 2003, 11th Symposium on High Performance Interconnects, 2003. Proceedings..

[21]  José L. Abellán,et al.  GLocks: Efficient Support for Highly-Contended Locks in Many-Core CMPs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[22]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[23]  Allan Porterfield,et al.  The Tera computer system , 1990, ICS '90.

[24]  Steven L. Scott,et al.  Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[25]  W. Daniel Hillis,et al.  The network architecture of the Connection Machine CM-5 (extended abstract) , 1992, SPAA '92.

[26]  Jaehwan Lee,et al.  A system-on-a-chip lock cache with task preemption support , 2001, CASES '01.