Dynamic Atomic Snapshots

Snapshots are useful tools for monitoring big distributed and parallel systems. In this paper, we adapt the well-known atomic snapshot abstraction to dynamic models with an unbounded number of participating processes. Our dynamic snapshot specification extends the API to allow changing the set of processes whose values should be returned from a scan operation. We introduce the ephemeral memory model, which consists of a dynamically changing set of nodes; when a node is removed, its memory can be immediately reclaimed. In this model, we present an algorithm for wait-free dynamic atomic snapshots.

[1]  Marina Papatriantafilou,et al.  A Consistency Framework for Iteration Operations in Concurrent Data Structures , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[2]  Peter K. Szwed,et al.  Application-level checkpointing for shared memory programs , 2004, ASPLOS XI.

[3]  Song Jiang,et al.  Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[4]  Nancy A. Lynch,et al.  Rambo: a robust, reconfigurable atomic memory service for dynamic networks , 2010, Distributed Computing.

[5]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[6]  Hagit Attiya,et al.  Atomic Snapshots in O(n log n) Operations , 1998, SIAM J. Comput..

[7]  Michel Dagenais,et al.  A framework to compute statistics of system parameters from very large trace files , 2013, OPSR.

[8]  Miron Livny,et al.  Managing network resources in Condor , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[9]  Peter Scheuermann,et al.  A deadlock checkpointing scheme for multidatabase systems , 1992, [1992 Proceedings] Second International Workshop on Research Issues on Data Engineering: Transaction and Query Processing.

[10]  Hein Meling,et al.  SmartMerge: A New Approach to Reconfiguration for Atomic Storage , 2015, DISC.

[11]  Idit Keidar,et al.  Byzantine disk paxos: optimal resilience with byzantine shared memory , 2004, PODC.

[12]  Nir Shavit,et al.  Atomic snapshots of shared memory , 1990, PODC '90.

[13]  S. G. Wang,et al.  Deadlock control of multithreaded software based on Petri nets: A brief review , 2016, 2016 IEEE 13th International Conference on Networking, Sensing, and Control (ICNSC).

[14]  Leslie Lamport,et al.  Paxos Made Simple , 2001 .

[15]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[16]  Yehuda Afek,et al.  Benign Failure Models for Shared Memory (Preliminary Version) , 1993, WDAG.

[17]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[18]  Nikolaos D. Kallimanis,et al.  Wait-Free Concurrent Graph Objects with Dynamic Traversals , 2015, OPODIS.

[19]  Dilma Da Silva,et al.  Alleviating scalability issues of checkpointing protocols , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  A. Spiegelman,et al.  Dynamic Reconfiguration: A Tutorial∗ , 2016 .

[21]  Hagit Attiya,et al.  Sharing memory robustly in message-passing systems , 1990, PODC '90.

[22]  Marcos K. Aguilera,et al.  A pleasant stroll through the land of infinitely many creatures , 2004, SIGA.

[23]  Sam Toueg,et al.  Fault-tolerant wait-free shared objects , 1992, Proceedings., 33rd Annual Symposium on Foundations of Computer Science.

[24]  Marina Papatriantafilou,et al.  Of Concurrent Data Structures and Iterations , 2015, Algorithms, Probability, Networks, and Games.

[25]  Eli Gafni,et al.  Elastic Configuration Maintenance via a Parsimonious Speculating Snapshot Solution , 2015, DISC.

[26]  Nancy A. Lynch,et al.  RAMBO: A Reconfigurable Atomic Memory Service for Dynamic Networks , 2002, DISC.

[27]  Nathan Stone A Checkpoint and Recovery System for the Pittsburgh Supercomputing Center Terascale Computing System , 2001 .

[28]  Idit Keidar,et al.  On Liveness of Dynamic Storage , 2015, SIROCCO.

[29]  Alex A. Shvartsmanz Rambo: A Reconfigurable Atomic Memory Service for Dynamic Networks , 2002 .

[30]  Faith Ellen,et al.  Simulating a Shared Register in an Asynchronous System that Never Stops Changing - (Extended Abstract) , 2015, DISC.

[31]  Emin Gün Sirer,et al.  Majority Is Not Enough: Bitcoin Mining Is Vulnerable , 2013, Financial Cryptography.

[32]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX Annual Technical Conference.

[33]  Eli Gafni,et al.  The concurrency hierarchy, and algorithms for unbounded concurrency , 2001, PODC '01.

[34]  Christian Cachin,et al.  Architecture of the Hyperledger Blockchain Fabric , 2016 .

[35]  Daniel Marques,et al.  Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[36]  Leslie Lamport,et al.  Reconfiguring a state machine , 2010, SIGA.

[37]  Song Jiang,et al.  Current practice and a direction forward in checkpoint/restart implementations for fault tolerance , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.