This position paper describes the results of our on-going investigation into the possibility and cost of building a survivable store. It considers optimal resilience systems comprising of 3t+1 base storage units, t of which may fail by becoming non-responsive or arbitrarily corrupted. Our contribution includes both algorithms and lower bounds in this model. We illuminate an inherent difficulty of achieving optimal resilience in the form of two lower bounds, on read and on write complexities. We also provide the first optimal-resilience wait-free algorithms that match these bounds in performance. Finally, we suggest some directions for future research. Introduction. Replicating storage is a fundamental mechanism for fault tolerance. When storage can exhibit arbitrary corruption, replication can mask corrupted elements and guarantee survivability. The formal model capturing this setting is an asynchronous system with multiple processes accessing fault-prone shared memory objects [3, 2, 14]. We assume that a threshold t of the memory objects may fail by being non-responsive or by returning arbitrary values (i.e., by being Byzantine); this failure model was named non-responsive arbitrary (NR-Arbitrary) faults by Jayanti et al. [14]. An unbounded number of processes may access the shared memory objects, and these processes may fail by crashing. We focus on wait-free reliable storage solutions; that is, solutions that guarantee that all operations submitted by correct processes terminate despite the failure of an unbounded number of processes. Over the past couple of years, our research in this area has focused on studying the limits of survivable storage in terms of a resilience threshold and protocol complexity. We contribute a full picture concerning the possibility and complexity of emulating survivable storage in our model out of n = 3t+1 memory elements, including algorithms and lower bounds. Our work shows that optimal resilience algorithms have an inherent cost: We prove in [8] a lower bound of two rounds for emulating write operations with a resilience of t ≥ n/4. This is in contrast to algorithms tolerating t < n/4 NR-Arbitrary faults, which can emulate write operations in a single round. Moreover, we show a lower bound of min(t + 1, f + 2) rounds for emulating read operations in runs with f failures in systems where the reader does not modify the base objects. These bounds are tight: In an accompanying paper [1], we provide the first emulation of a wait-free safe regular register out of 3t + 1 registers, whose read and write complexities are optimal according to the above lower bounds. That work also presents a full Byzantine shared-memory Paxos algorithm built with the reliable register emulation. Motivation. A “storage centric” approach for service replication, discussed by Malkhi in [20], models the system as a fault-prone shared memory model. This paradigm captures a fair amount of recent work, that comes in three main flavors: ∗CS and AI Lab, MIT, grishac@theory.lcs.mit.edu †Department of Electrical Engineering, The Technion – Israel Institute of Technology. idish@ee.technion.ac.il ‡School of Computer Science and Engineering, The Hebrew University of Jerusalem. dalia@cs.huji.ac.il
[1]
Yehuda Afek,et al.
Benign Failure Models for Shared Memory (Preliminary Version)
,
1993,
WDAG.
[2]
Nancy A. Lynch,et al.
Consensus in the presence of partial synchrony
,
1988,
JACM.
[3]
Shiding Lin,et al.
A Practical Distributed Mutual Exclusion Protocol in Dynamic Peer-to-Peer Systems
,
2004,
IPTPS.
[4]
Dahlia Malkhi,et al.
From Byzantine agreement to practical survivability: a position paper
,
2002,
21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings..
[5]
Michael K. Reiter,et al.
Backoff protocols for distributed mutual exclusion and ordering
,
2001,
Proceedings 21st International Conference on Distributed Computing Systems.
[6]
Chandramohan A. Thekkath,et al.
Petal: distributed virtual disks
,
1996,
ASPLOS VII.
[7]
Leslie Lamport,et al.
Disk Paxos
,
2003,
Distributed Computing.
[8]
Fred B. Schneider,et al.
COCA: a secure distributed online certification authority
,
2002
.
[9]
Miguel Oom Temudo de Castro,et al.
Practical Byzantine fault tolerance
,
1999,
OSDI '99.
[10]
H. Venkateswaran,et al.
Responsive Security for Stored Data
,
2003,
IEEE Trans. Parallel Distributed Syst..
[11]
Dahlia Malkhi,et al.
Active disk paxos with infinitely many processes
,
2002,
PODC.
[12]
Michael Dahlin,et al.
Minimal Byzantine Storage
,
2002,
DISC.
[13]
Rodrigo Rodrigues,et al.
Rosebud: A Scalable Byzantine-Fault-Tolerant Storage Architecture
,
2003
.
[14]
Hagit Attiya,et al.
Sharing Memory with Semi-byzantine Clients and Faulty Storage Servers
,
2006,
Parallel Process. Lett..
[15]
Sam Toueg,et al.
Asynchronous consensus and broadcast protocols
,
1985,
JACM.
[16]
Dahlia Malkhi,et al.
Light-Weight Leases for Storage-Centric Coordination
,
2006,
International Journal of Parallel Programming.
[17]
Vassos Hadzilacos,et al.
Using Failure Detectors to Solve Consensus in Asynchronous Sharde-Memory Systems (Extended Abstract)
,
1994,
WDAG.
[18]
Michael K. Reiter,et al.
An Architecture for Survivable Coordination in Large Distributed Systems
,
2000,
IEEE Trans. Knowl. Data Eng..
[19]
Rida A. Bazzi.
Synchronous Byzantine quorum systems
,
1997,
PODC '97.
[20]
Chandramohan A. Thekkath,et al.
Frangipani: a scalable distributed file system
,
1997,
SOSP.
[21]
Idit Keidar,et al.
Byzantine disk paxos: optimal resilience with byzantine shared memory
,
2004,
PODC.
[22]
Sam Toueg,et al.
Fault-tolerant wait-free shared objects
,
1992,
Proceedings., 33rd Annual Symposium on Foundations of Computer Science.
[23]
David S. Greenberg,et al.
Computing with faulty shared objects
,
1995,
JACM.