A scalable ordering primitive for multicore machines

Timestamping is an essential building block for designing concurrency control mechanisms and concurrent data structures. Various algorithms either employ physical timestamping, assuming that they have access to synchronized clocks, or maintain a logical clock with the help of atomic instructions. Unfortunately, these approaches have two problems. First, hardware developers do not guarantee that the available hardware clocks are exactly synchronized, which they find difficult to achieve in practice. Second, the atomic instructions are a deterrent to scalability resulting from cache-line contention. This paper addresses these problems by proposing and designing a scalable ordering primitive, called Ordo, that relies on invariant hardware clocks. Ordo not only enables the correct use of these clocks, by providing a notion of a global hardware clock, but also frees various logical timestamp-based algorithms from the burden of the software logical clock, while trying to simplify their design. We use the Ordo primitive to redesign 1) a concurrent data structure library that we apply on the Linux kernel; 2) a synchronization mechanism for concurrent programming; 3) two database concurrency control mechanisms; and 4) a clock-based software transactional memory algorithm. Our evaluation shows that there is a possibility that the clocks are not synchronized on two architectures (Intel and ARM) and that Ordo generally improves the efficiency of several algorithms by 1.2--39.7X on various architectures.

[1]  Yujie Liu,et al.  Boosting timestamp-based transactional memory by exploiting hardware cycle counters , 2013, TACO.

[2]  Ehsan Atoofian,et al.  AGC: adaptive global clock in software transactional memory , 2012, PMAM '12.

[3]  Robert Gruber,et al.  Efficient optimistic concurrency control using loosely synchronized clocks , 1995, SIGMOD '95.

[4]  Idit Keidar,et al.  Transactional data structure libraries , 2016, PLDI.

[5]  Xiang Yuan,et al.  ReCBuLC: Reproducing Concurrency Bugs Using Local Clocks , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[6]  Victor Luchangco,et al.  Anatomy of a Scalable Software Transactional Memory , 2009 .

[7]  Torvald Riegel,et al.  Time-based transactional memory with scalable time bases , 2007, SPAA '07.

[8]  Eddie Kohler,et al.  Type-aware transactions for faster concurrent code , 2016, EuroSys.

[9]  Hakim Weatherspoon,et al.  Globally Synchronized Time via Datacenter Networks , 2016, SIGCOMM.

[10]  Ippokratis Pandis,et al.  ERMIA: Fast Memory-Optimized Database System for Heterogeneous Workloads , 2016, SIGMOD Conference.

[11]  Changwoo Min,et al.  Understanding Manycore Scalability of File Systems , 2016, USENIX Annual Technical Conference.

[12]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[13]  Torvald Riegel,et al.  Dynamic performance tuning of word-based software transactional memory , 2008, PPoPP.

[14]  Craig B. Stunkel,et al.  Time synchronization on SP1 and SP2 parallel systems , 1995, Proceedings of 9th International Parallel Processing Symposium.

[15]  Carlo Muscas,et al.  GPS and IEEE 1588 synchronization for the measurement of synchrophasors in electric power systems , 2011, Comput. Stand. Interfaces.

[16]  Nir Shavit,et al.  Software transactional memory , 1995, PODC '95.

[17]  Christoph M. Kirsch,et al.  A Scalable, Correct Time-Stamped Stack , 2015, POPL.

[18]  David L. Mills,et al.  A brief history of NTP time: memoirs of an Internet timekeeper , 2003, CCRV.

[19]  Sam Toueg,et al.  Optimal clock synchronization , 1985, PODC '85.

[20]  Hamid Pirahesh,et al.  ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging , 1998 .

[21]  Ippokratis Pandis,et al.  Aether: A Scalable Approach to Logging , 2010, Proc. VLDB Endow..

[22]  Tudor David,et al.  Everything you always wanted to know about synchronization but were afraid to ask , 2013, SOSP.

[23]  Babak Falsafi,et al.  Shore-MT: a scalable storage manager for the multicore era , 2009, EDBT '09.

[24]  Eddie Kohler,et al.  Speedy transactions in multicore in-memory databases , 2013, SOSP.

[25]  Torsten Hoefler,et al.  Evaluating the Cost of Atomic Operations on Modern Architectures , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[26]  Robert Tappan Morris,et al.  An Analysis of Linux Scalability to Many Cores , 2010, OSDI.

[27]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[28]  Michael Stonebraker,et al.  Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores , 2014, Proc. VLDB Endow..

[29]  Torvald Riegel,et al.  A Lazy Snapshot Algorithm with Eager Validation , 2006, DISC.

[30]  Srinivas Devadas,et al.  TicToc: Time Traveling Optimistic Concurrency Control , 2016, SIGMOD Conference.

[31]  Silas Boyd-Wickizer,et al.  OpLog: a library for scaling update-heavy data structures , 2014 .

[32]  Xiaoning Ding,et al.  BCC: Reducing False Aborts in Optimistic Concurrency Control with Low Cost for In-Memory Databases , 2016, Proc. VLDB Endow..

[33]  Hideaki Kimura,et al.  FOEDUS: OLTP Engine for a Thousand Cores and NVRAM , 2015, SIGMOD Conference.

[34]  Cody Cutler,et al.  Phase Reconciliation for Contended In-Memory Transactions , 2014, OSDI.

[35]  Carlo Curino,et al.  Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems , 2012, SIGMOD Conference.

[36]  Jignesh M. Patel,et al.  High-Performance Concurrency Control Mechanisms for Main-Memory Databases , 2011, Proc. VLDB Endow..

[37]  Joo Young Hwang,et al.  F2FS: A New File System for Flash Storage , 2015, FAST.

[38]  Ryan Johnson,et al.  Scalable Logging through Emerging Non-Volatile Memory , 2014, Proc. VLDB Endow..

[39]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[40]  Wei Shao-rong,et al.  On Object-based Reverse Mapping , 2013 .

[41]  Nir Shavit,et al.  Transactional Locking II , 2006, DISC.

[42]  Jonathan Walpole,et al.  Exploiting deferred destruction: an analysis of read-copy-update techniques in operating system kernels , 2004 .

[43]  Qi Wang,et al.  Parallel sections: scaling system-level data-structures , 2016, EuroSys.

[44]  Rui Zhang,et al.  Commit phase in timestamp-based stm , 2008, SPAA '08.

[45]  Nir Shavit,et al.  Read-log-update: a lightweight synchronization mechanism for concurrent programming , 2015, SOSP.

[46]  Philip A. Bernstein,et al.  Categories and Subject Descriptors: H.2.4 [Database Management]: Systems. , 2022 .

[47]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[48]  Adam Morrison,et al.  Predicate RCU: an RCU for scalable concurrent updates , 2015, PPOPP.

[49]  Nir Shavit,et al.  Maintaining Consistent Transactional States without a Global Clock , 2008, SIROCCO.

[50]  Flaviu Cristian,et al.  Probabilistic clock synchronization , 1989, Distributed Computing.

[51]  Vivien Quéma,et al.  Thread and Memory Placement on NUMA Systems: Asymmetry Matters , 2015, USENIX Annual Technical Conference.

[52]  J. T. Robinson,et al.  On optimistic methods for concurrency control , 1979, TODS.

[53]  Adrian Schüpbach,et al.  The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.