High-Performance Distributed RMA Locks

We propose a topology-aware distributed Reader-Writer lock that accelerates irregular workloads for supercomputers and data centers. The core idea behind the lock is a modular design that is an interplay of three distributed data structures: a counter of readers/writers in the critical section, a set of queues for ordering writers waiting for the lock, and a tree that binds all the queues and synchronizes writers with readers. Each structure is associated with a parameter for favoring either readers or writers, enabling adjustable performance that can be viewed as a point in a three dimensional parameter space. We also develop a distributed topology-aware MCS lock that is a building block of the above design and improves state-of-the-art MPI implementations. Both schemes use non-blocking Remote Memory Access (RMA) techniques for highest performance and scalability. We evaluate our schemes on a Cray XC30 and illustrate that they outperform state-of-the-art MPI-3 RMA locking protocols by 81% and 73%, respectively. Finally, we use them to accelerate a distributed hashtable that represents irregular workloads such as key-value stores or graph processing.

[1]  Victor Luchangco,et al.  Scalable reader-writer locks , 2009, SPAA '09.

[2]  Renato Recio,et al.  A Remote Direct Memory Access Protocol Specification , 2007, RFC.

[3]  D. M. Hutton,et al.  The Art of Multiprocessor Programming , 2008 .

[4]  Torsten Hoefler,et al.  The PERCS High-Performance Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[5]  Torsten Hoefler,et al.  Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages , 2015, HPDC.

[6]  Thomas E. Anderson,et al.  The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors , 1990, IEEE Trans. Parallel Distributed Syst..

[7]  Nir Shavit,et al.  A Hierarchical CLH Queue Lock , 2006, Euro-Par.

[8]  Brian W. Barrett,et al.  Introducing the Graph 500 , 2010 .

[9]  Pavan Balaji,et al.  RDMA Capable iWARP over Datagrams , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[10]  John Shalf,et al.  Programming Abstractions for Data Locality , 2014 .

[11]  John M. Mellor-Crummey,et al.  High performance locks for multi-level NUMA systems , 2015, PPoPP.

[12]  Erik Hagersten,et al.  Queue locks on cache coherent multiprocessors , 1994, Proceedings of 8th International Parallel Processing Symposium.

[13]  Katherine E. Isaacs,et al.  There goes the neighborhood: Performance degradation due to nearby jobs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[14]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[15]  William N. Scherer,et al.  Scalable queue-based spin locks with timeout , 2001, PPoPP '01.

[16]  Hemal Shah,et al.  Remote Direct Memory Access (RDMA) Protocol Extensions , 2014, RFC.

[17]  Michael Stumm,et al.  A Fair Fast Scalable Rea,der-Writer Lock , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[18]  Torsten Hoefler,et al.  Netgauge: A Network Performance Measurement Framework , 2007, HPCC.

[19]  Torsten Hoefler,et al.  Enabling highly-scalable remote memory access programming with MPI-3 one sided , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[20]  Nir Shavit,et al.  Flat-combining NUMA locks , 2011, SPAA '11.

[21]  Hiroaki Takada,et al.  Predictable spin lock algorithms with preemption , 1994, Proceedings of 11th IEEE Workshop on Real-Time Operating Systems and Software.

[22]  Gerard J. Holzmann,et al.  The Model Checker SPIN , 1997, IEEE Trans. Software Eng..

[23]  Nir Shavit,et al.  NUMA-aware reader-writer locks , 2013, PPoPP '13.

[24]  Hui Ding,et al.  TAO: how facebook serves the social graph , 2012, SIGMOD Conference.

[25]  David Lorge Parnas,et al.  Concurrent control with “readers” and “writers” , 1971, CACM.

[26]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[27]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[28]  Mike Higgins,et al.  Cray Cascade: A scalable HPC system based on a Dragonfly network , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  Traviss. Craig,et al.  Building FIFO and Priority-Queuing Spin Locks from Atomic Swap , 1993 .

[30]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[31]  Torsten Hoefler,et al.  Evaluating the Cost of Atomic Operations on Modern Architectures , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[32]  Hui Ding,et al.  TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.

[33]  Torsten Hoefler,et al.  Active Access: A Mechanism for High-Performance Distributed Data-Centric Computations , 2015, ICS.

[34]  William J. Dally,et al.  Technology-Driven, Highly-Scalable Dragonfly Topology , 2008, 2008 International Symposium on Computer Architecture.

[35]  Katherine Yelick,et al.  Titanium Language Reference Manual (Version 2.20) , 2006 .

[36]  Erik Hagersten,et al.  Hierarchical backoff locks for nonuniform communication architectures , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[37]  Wilson C. Hsieh,et al.  Scalable reader-writer locks for parallel systems , 1992, Proceedings Sixth International Parallel Processing Symposium.

[38]  Nir Shavit,et al.  Lock Cohorting , 2015, ACM Trans. Parallel Comput..

[39]  Michael L. Scott,et al.  Scalable reader-writer synchronization for shared-memory multiprocessors , 1991, PPOPP '91.

[40]  Katherine Yelick,et al.  Titanium Language Reference Manual, version 2.19 , 2005 .

[41]  Torsten Hoefler,et al.  Fault tolerance for remote memory access programming models , 2014, HPDC '14.

[42]  Torsten Hoefler,et al.  Using Advanced MPI: Modern Features of the Message-Passing Interface , 2014 .