Self-stabilization Overhead: an Experimental Case Study on Coded Atomic Storage

Shared memory emulation can be used as a fault-tolerant and highly available distributed storage solution or as a low-level synchronization primitive. Attiya, Bar-Noy, and Dolev were the first to propose a single-writer, multi-reader linearizable register emulation where the register is replicated to all servers. Recently, Cadambe et al. proposed the Coded Atomic Storage (CAS) algorithm, which uses erasure coding for achieving data redundancy with much lower communication cost than previous algorithmic solutions. Although CAS can tolerate server crashes, it was not designed to recover from unexpected, transient faults, without the need of external (human) intervention. In this respect, Dolev, Petig, and Schiller have recently developed a self-stabilizing version of CAS, which we call CASSS. As one would expect, self-stabilization does not come as a free lunch; it introduces, mainly, communication overhead for detecting inconsistencies and stale information. So, one would wonder whether the overhead introduced by self-stabilization would nullify the gain of erasure coding. To answer this question, we have implemented and experimentally evaluated the CASSS algorithm on PlanetLab; a planetary scale distributed infrastructure. The evaluation shows that our implementation of CASSS scales very well in terms of the number of servers, the number of concurrent clients, as well as the size of the replicated object. More importantly, it shows (a) to have only a constant overhead compared to the traditional CAS algorithm (which we also implement) and (b) the recovery period (after the last occurrence of a transient fault) is as fast as a few client (read/write) operations. Our results suggest that CASSS does not significantly impact efficiency while dealing with automatic recovery from transient faults and bounded size of needed resources.

[1]  Nancy A. Lynch,et al.  A coded shared atomic memory algorithm for message passing architectures , 2014, 2014 IEEE 13th International Symposium on Network Computing and Applications.

[2]  Nancy A. Lynch,et al.  ARES: Adaptive, Reconfigurable, Erasure Coded, Atomic Storage , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).

[3]  Nancy A. Lynch,et al.  Efficient Replication of Large Data Objects , 2003, DISC.

[4]  Noga Alon,et al.  Pragmatic Self-stabilization of Atomic Memory in Message-Passing Systems , 2011, SSS.

[5]  Maria Gradinariu Potop-Butucaru,et al.  Crash Resilient and Pseudo-Stabilizing Atomic Registers , 2012, OPODIS.

[6]  Shlomi Dolev,et al.  Self-Stabilizing and Private Distributed Shared Atomic Memory in Seldomly Fair Message Passing Networks , 2018, Algorithmica.

[7]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[8]  Peter M. Musial,et al.  Implementing distributed shared memory for dynamic networks , 2014, CACM.

[9]  Chryssis Georgiou,et al.  Self-stabilizing Reconfiguration , 2016, NETYS.

[10]  Nancy A. Lynch,et al.  RAMBO: A Reconfigurable Atomic Memory Service for Dynamic Networks , 2002, DISC.

[11]  Nancy A. Lynch,et al.  Robust emulation of shared memory using dynamic quorum-acknowledged broadcasts , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[12]  Chryssis Georgiou,et al.  Practically-self-stabilizing virtual synchrony , 2015, J. Comput. Syst. Sci..

[13]  Nancy A. Lynch,et al.  Communication and data sharing for dynamic distributed systems , 2003 .

[14]  Chryssis Georgiou,et al.  On the Practicality of Atomic MWMR Register Implementations , 2011, 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications.

[15]  Hagit Attiya Robust Simulation of Shared Memory: 20 Years After , 2010, Bull. EATCS.