Scalable Data-structures with Hierarchical, Distributed Delegation

Scaling data-structures up to the increasing number of cores provided by modern systems is challenging. The quest for scalability is complicated by the non-uniform memory accesses (NUMA) of multi-socket machines that often prohibit the effective use of data-structures that span memory localities. Conventional shared memory data-structures using efficient non-blocking or lock-based implementations inevitably suffer from cache-coherency overheads, and non-local memory accesses between sockets. Multi-socket systems are common in cloud hardware, and many products are pushing shared memory systems to greater scales, thus making the ability to scale data-structures all the more pressing. In this paper, we present the Distributed, Delegated Parallel Sections (DPS) runtime system that uses message-passing to move the computation on portions of data-structures between memory localities, while leveraging efficient shared memory implementations within each locality to harness efficient parallelism. We show through a series of data-structure scalability evaluations, and through an adaptation of memcached, that DPS enables strong data-structure scalability. DPS provides more than a factor of 3.1 improvements in throughput, and 23x decreases in tail latency for memcached.

[1]  Dan Alistarh,et al.  The SprayList: a scalable relaxed priority queue , 2015, PPoPP.

[2]  Konstantinos Sagonas,et al.  Queue delegation locking , 2014, SPAA.

[3]  Qi Wang,et al.  Parallel sections: scaling system-level data-structures , 2016, EuroSys.

[4]  Ana Sokolova,et al.  Quantitative relaxation of concurrent data structures , 2013, POPL.

[5]  Maurice Herlihy,et al.  Message Passing or Shared Memory: Evaluating the Delegation Abstraction for Multicores , 2013, OPODIS.

[6]  Eddie Kohler,et al.  Speedy transactions in multicore in-memory databases , 2013, SOSP.

[7]  Jakob Eriksson,et al.  ffwd: delegation is (much) faster than you think , 2017, SOSP.

[8]  Timothy Roscoe,et al.  Shoal: Smart Allocation and Replication of Memory For Parallel Programs , 2015, USENIX Annual Technical Conference.

[9]  Nir Shavit,et al.  Skiplist-based concurrent priority queues , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[10]  Tudor David,et al.  Asynchronized Concurrency: The Secret to Scaling Concurrent Search Data Structures , 2015, ASPLOS.

[11]  Kihong Kim,et al.  Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-Memory Multiprocessor Systems , 2001, VLDB.

[12]  Panagiota Fatourou,et al.  Revisiting the combining synchronization technique , 2012, PPoPP '12.

[13]  Jonathan Walpole,et al.  Resizable, Scalable, Concurrent Hash Tables via Relativistic Programming , 2011, USENIX ATC.

[14]  Nir Shavit,et al.  Flat combining and the synchronization-parallelism tradeoff , 2010, SPAA '10.

[15]  Michael Stumm,et al.  FlexSC: Flexible System Call Scheduling with Exception-Less System Calls , 2010, OSDI.

[16]  Eddie Kohler,et al.  Cache craftiness for fast multicore key-value storage , 2012, EuroSys '12.

[17]  Marcos K. Aguilera,et al.  Black-box Concurrent Data Structures for NUMA Architectures , 2017, ASPLOS.

[18]  Vivien Quéma,et al.  Thread and Memory Placement on NUMA Systems: Asymmetry Matters , 2015, USENIX Annual Technical Conference.

[19]  Ali Ghodsi,et al.  Coordination Avoidance in Database Systems , 2014, Proc. VLDB Endow..

[20]  Hagit Attiya,et al.  Concurrent updates with RCU: search tree as an example , 2014, PODC '14.

[21]  Maurice Herlihy,et al.  A Simple Optimistic Skiplist Algorithm , 2007, SIROCCO.

[22]  Joseph M. Hellerstein,et al.  Anna: A KVS for Any Scale , 2019, IEEE Transactions on Knowledge and Data Engineering.

[23]  Rachid Guerraoui,et al.  Optimistic concurrency with OPTIK , 2016, PPOPP.

[24]  Onur Mutlu,et al.  Accelerating critical section execution with asymmetric multi-core architectures , 2009, ASPLOS.

[25]  Maurice Herlihy,et al.  Using Elimination and Delegation to Implement a Scalable NUMA-Friendly Stack , 2013, HotPar.

[26]  Sudipta Sengupta,et al.  The Bw-Tree: A B-tree for new hardware platforms , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[27]  Vincent Gramoli,et al.  More than you ever wanted to know about synchronization: synchrobench, measuring the impact of the synchronization on concurrent algorithms , 2015, PPoPP.

[28]  Pradeep Dubey,et al.  PALM: Parallel Architecture-Friendly Latch-Free Modifications to B+ Trees on Many-Core Processors , 2011, Proc. VLDB Endow..

[29]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[30]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[31]  Neeraj Mittal,et al.  Fast concurrent lock-free binary search trees , 2014, PPoPP.

[32]  Yehuda Afek,et al.  Fast concurrent queues for x86 processors , 2013, PPoPP '13.

[33]  Julia L. Lawall,et al.  Remote Core Locking: Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications , 2012, USENIX Annual Technical Conference.

[34]  Jim Gray,et al.  A critique of ANSI SQL isolation levels , 1995, SIGMOD '95.

[35]  Maged M. Michael,et al.  High performance dynamic lock-free hash tables and list-based sets , 2002, SPAA '02.

[36]  Hector Garcia-Molina,et al.  Overview of multidatabase transaction management , 2005, The VLDB Journal.

[37]  M. Frans Kaashoek,et al.  CPHASH: a cache-partitioned hash table , 2012, PPoPP '12.

[38]  Peter Sanders,et al.  MultiQueues: Simple Relaxed Concurrent Priority Queues , 2015, SPAA.

[39]  Shane V. Howley,et al.  A non-blocking internal binary search tree , 2012, SPAA '12.

[40]  Ana Sokolova,et al.  Distributed queues in shared memory: multicore performance and scalability through quantitative relaxation , 2013, CF '13.

[41]  Nir Shavit,et al.  Read-log-update: a lightweight synchronization mechanism for concurrent programming , 2015, SOSP.

[42]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[43]  Maurice Herlihy,et al.  A Lazy Concurrent List-Based Set Algorithm , 2005, OPODIS.

[44]  S. B. Yao,et al.  Efficient locking for concurrent operations on B-trees , 1981, TODS.

[45]  Kunle Olukotun,et al.  A practical concurrent binary search tree , 2010, PPoPP '10.

[46]  Nir Shavit,et al.  Lock Cohorting , 2015, ACM Trans. Parallel Comput..

[47]  Marko Vukolic,et al.  Consistency in Non-Transactional Distributed Storage Systems , 2015, ACM Comput. Surv..

[48]  Ali Ghodsi,et al.  Highly Available Transactions: Virtues and Limitations , 2013, Proc. VLDB Endow..

[49]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[50]  Song Jiang,et al.  Wormhole: A Fast Ordered Index for In-memory Data Management , 2018 .