Collaborative backup for self-interested hosts

Despite its importance, current approaches to backup are inadequate. Large-scale solutions require aggregation of substantial demand to justify the costs of managing a large, centralized repository. Small-scale solutions require significant administrative effort by the end user. This dissertation presents Pastiche, a low-cost, convenient backup service. Pastiche exploits excess disk capacity to create a peer-to-peer storage collective on independent, untrusted, and self-interested hosts. Costs and inconvenience are kept low by taking advantage of the excess storage capacity provided by individual contributors and through self-organization distributed components. Because hosts can come and go without warning, data is replicated at multiple hosts, called backup buddies. To reduce the cost of replication, users try to identify hosts that already have most of their data so that they only have to send what it is unique to them. This reduces the global storage burden and bandwidth overhead. A study of workstations in the EECS department revealed that for common installations, nodes can expect to find sufficient number of buddies with between 30% and 70% common data. Because hosts are self-interested, Pastiche is vulnerable to free-loading. If not kept in check, free-loading can collapse the collective. We explore three strategies for eliminating free-loading in Pastiche: bilateral-equal exchange, storage claims, and cyclic exchange. To begin, storage must be allocated through bilateral, equal exchange between hosts. If A stores data on B, B stores an equal amount on A. A can periodically query B to ensure that its data is honored, and vice-versa. If either fails a query, its data is dropped in retaliation. Unfortunately, bilateral, equal exchange overconstrains storage allocation because it requires a double coincidence of wants between hosts. To enable more flexible allocation, nodes can use storage claims---uncompressible, storage placeholders. Claims are traded for actual data and can be used as a store of value for future exchanges through overwriting and forwarding. To provide greater data reliability, we can use cyclic exchange . Cyclic exchange provides flexible, fair, and sustainable storage allocation with low network and storage overhead. In cyclic exchange, nodes construct a distributed demand graph based on their preferred storage sites.