论文信息 - High-Performance Distributed RMA Locks

High-Performance Distributed RMA Locks

We propose a topology-aware distributed Reader-Writer lock that accelerates irregular workloads for supercomputers and data centers. The core idea behind the lock is a modular design that is an interplay of three distributed data structures: a counter of readers/writers in the critical section, a set of queues for ordering writers waiting for the lock, and a tree that binds all the queues and synchronizes writers with readers. Each structure is associated with a parameter for favoring either readers or writers, enabling adjustable performance that can be viewed as a point in a three dimensional parameter space. We also develop a distributed topology-aware MCS lock that is a building block of the above design and improves state-of-the-art MPI implementations. Both schemes use non-blocking Remote Memory Access (RMA) techniques for highest performance and scalability. We evaluate our schemes on a Cray XC30 and illustrate that they outperform state-of-the-art MPI-3 RMA locking protocols by 81% and 73%, respectively. Finally, we use them to accelerate a distributed hashtable that represents irregular workloads such as key-value stores or graph processing.

[1] Victor Luchangco,et al. Scalable reader-writer locks , 2009, SPAA '09.

[2] Renato Recio,et al. A Remote Direct Memory Access Protocol Specification , 2007, RFC.

[3] D. M. Hutton,et al. The Art of Multiprocessor Programming , 2008 .

[4] Torsten Hoefler,et al. The PERCS High-Performance Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[5] Torsten Hoefler,et al. Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages , 2015, HPDC.

[6] Thomas E. Anderson,et al. The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors , 1990, IEEE Trans. Parallel Distributed Syst..

[7] Nir Shavit,et al. A Hierarchical CLH Queue Lock , 2006, Euro-Par.

[8] Brian W. Barrett,et al. Introducing the Graph 500 , 2010 .

[9] Pavan Balaji,et al. RDMA Capable iWARP over Datagrams , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[10] John Shalf,et al. Programming Abstractions for Data Locality , 2014 .

[11] John M. Mellor-Crummey,et al. High performance locks for multi-level NUMA systems , 2015, PPoPP.

[12] Erik Hagersten,et al. Queue locks on cache coherent multiprocessors , 1994, Proceedings of 8th International Parallel Processing Symposium.

[13] Katherine E. Isaacs,et al. There goes the neighborhood: Performance degradation due to nearby jobs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[14] Vivek Sarkar,et al. X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[15] William N. Scherer,et al. Scalable queue-based spin locks with timeout , 2001, PPoPP '01.

[16] Hemal Shah,et al. Remote Direct Memory Access (RDMA) Protocol Extensions , 2014, RFC.

[17] Michael Stumm,et al. A Fair Fast Scalable Rea,der-Writer Lock , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[18] Torsten Hoefler,et al. Netgauge: A Network Performance Measurement Framework , 2007, HPCC.

[19] Torsten Hoefler,et al. Enabling highly-scalable remote memory access programming with MPI-3 one sided , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[20] Nir Shavit,et al. Flat-combining NUMA locks , 2011, SPAA '11.

[21] Hiroaki Takada,et al. Predictable spin lock algorithms with preemption , 1994, Proceedings of 11th IEEE Workshop on Real-Time Operating Systems and Software.

[22] Gerard J. Holzmann,et al. The Model Checker SPIN , 1997, IEEE Trans. Software Eng..

[23] Nir Shavit,et al. NUMA-aware reader-writer locks , 2013, PPoPP '13.

[24] Hui Ding,et al. TAO: how facebook serves the social graph , 2012, SIGMOD Conference.

[25] David Lorge Parnas,et al. Concurrent control with “readers” and “writers” , 1971, CACM.

[26] Michael L. Scott,et al. Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[27] Christian Bienia,et al. Benchmarking modern multiprocessors , 2011 .

[28] Mike Higgins,et al. Cray Cascade: A scalable HPC system based on a Dragonfly network , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[29] Traviss. Craig,et al. Building FIFO and Priority-Queuing Spin Locks from Atomic Swap , 1993 .

[30] Message Passing Interface Forum. MPI: A message - passing interface standard , 1994 .

[31] Torsten Hoefler,et al. Evaluating the Cost of Atomic Operations on Modern Architectures , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[32] Hui Ding,et al. TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.

[33] Torsten Hoefler,et al. Active Access: A Mechanism for High-Performance Distributed Data-Centric Computations , 2015, ICS.

[34] William J. Dally,et al. Technology-Driven, Highly-Scalable Dragonfly Topology , 2008, 2008 International Symposium on Computer Architecture.

[35] Katherine Yelick,et al. Titanium Language Reference Manual (Version 2.20) , 2006 .

[36] Erik Hagersten,et al. Hierarchical backoff locks for nonuniform communication architectures , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[37] Wilson C. Hsieh,et al. Scalable reader-writer locks for parallel systems , 1992, Proceedings Sixth International Parallel Processing Symposium.

[38] Nir Shavit,et al. Lock Cohorting , 2015, ACM Trans. Parallel Comput..

[39] Michael L. Scott,et al. Scalable reader-writer synchronization for shared-memory multiprocessors , 1991, PPOPP '91.

[40] Katherine Yelick,et al. Titanium Language Reference Manual, version 2.19 , 2005 .

[41] Torsten Hoefler,et al. Fault tolerance for remote memory access programming models , 2014, HPDC '14.

[42] Torsten Hoefler,et al. Using Advanced MPI: Modern Features of the Message-Passing Interface , 2014 .