Clique is a HP Labs Grenoble project. The goal is to develop a novel peer-to-peer, server-less distributed file system based on optimistic replication algorithms, which transparently integrates into users’ native file systems. Some properties of the Clique system are epidemic replication, a no lost updates consistency model and conflict management, as well as disconnected operation and replica convergence. These properties ensure that updates done by any peer of the group will never be lost, and also that they will converge on all the group member machines. The system is well adapted to highly disconnected environments, network partitions, and variable join/leave rates. Even under adverse connectivity conditions, over time, assuming intermittent point-to-point connectivity between each peer and at least one other peer in the group, the local file system view at each node converges towards a consistent global view. The reconciliation protocol used is stateless and has no notion of group me mbership, in order to achieve a linear worst-case scalability in the order of N, the number of peers in the network. A lower layer protocol has been developed, which enables one-to-all communications by taking advantage of IP Multicast augmented with network load management and a priority mechanism ensuring liveness of the higher layers of the protocol. 1. Problem statement The development of Clique was motivated by our perceived need for a new file sharing technology which supports collaboration and synchronization in a simple, transparent, reliable manner. Scenario 1 In the first case, individuals are beginning to use a range of increasingly powerful computing devices with highly varying connectivity patterns. These range from desktop PCs with a permanent, high bandwidth connection to the Internet via the corporate LAN and firewall, through laptops with 802.11 cards that offer good connectivity at medium speeds, but only within certain limited areas, to powerful PDA devices with low bandwidth and sporadic connectivity. A user, then, will soon have a sort of ‘personal network’ of intermittently connected clients which will increasingly be differentiated less by storage and processing capacity and more by physical mobility, network connectivity and bandwidth. The user would increasingly like to be able to access a given portfolio of files on any one of this wide range of devices. He will expect modifications made to any of these files on one device to be automatically reflected on another, even when devices are geographically far apart and connected to physically separate networks. He will want all files to be immediately accessible from all devices, even when operating in disconnected mode. Ideally, the system should scale to an arbitrary number of nodes, of shared files and versions. Scenario 2 In the second case, users will want to share their portfolio of files with groups of other people on other devices. Each of the users would like to have read access to the group files, but also to be able to modify them, to add new files and delete other ones, and they would like to have their changes shared with other users of the group. Typically, Adam, an HP employee using Clique, might wish to share files between 6 machines: A desktop PC connected via the HP intranet to the desktop PC of his team colleagues, Bill and Christiana, and separated from the Internet by a corporate firewall. A home 802.11 network of two desktop PCs which are connected to an ISP via an ADSL link and have dynamically assigned IP addresses. A laptop on which he works while commuting between work and home. This computer is intermittently connected to both the HP internal network and the home office network. During work hours, Adam’s laptop is connected to the HP intranet. Modifications made to any of the files or sub-directories in the Clique shared directory on one device, for example, Adam’s desktop PC, are automatically propagated in the background to other connected me mbers of the Clique group, in this case his laptop and his colleagues’ desktop PCs. At the end of the working day, Adam may disconnect his laptop and continue to work on some of the shared files as he commutes. On connecting his laptop to the home network, the modifications he and his colleagues made during the working day are automatically reflected on the home network PCs, and any modifications made, for example, by his wife Diana to the files on the local network are uploaded to the laptop (and hence may be uploaded to the HP desktop PCs, along with Adam’s own modifications, when Adam reconnects to the HP intranet the following morning). As we will show, this epidemic-style replication pattern guarantees that all nodes in the Clique group will eventually achieve a consistent view of the file system state. For maximum performance in the mobile environment, Clique employs a weakly consistent update policy, which does not place any limits on file modification permissions even when network access is unavailable. This introduces the likelihood of update conflicts, whereby mult iple users independently modify a particular file. In this case, mult iple distinct, but equally valid, versions of the file temp orarily exist in the system. Current groupware solutions often blindly overwrite some of these versions with others according to a simplistic metric such as, for exa mple, a comparison of version timestamps. This technique, known as the Thomas write rule [1], can have disastrous consequences from an end-user perspective. To avoid this, Clique uses a ‘no lost updates’ reconciliation policy, which guarantees that every update made at any node in the group will always eventually be ‘seen’ by all other nodes in the group, and no file modifications are ever irretrievably lost by the system. Motivations Our design motivations can be split into three categories; ease of use, support for real world conditions and some additional desirable system properties. Ease of use Transparency: Integration into the user file system. All nodes should ideally have a local copy of every file in the system. From the end-user perspective, the shared files should not appear to be any different from the standard files available on the user's hard disk. File access latency times should not be noticeably different from ordinary files. Self-organization and full decentralization: No server is required, and there is no primary (master) repository for particular files. All nodes ‘own’ all files, and all share the administrative tasks such as setup. Stability: At any moment, all files in the local file system should be in a valid, usable state. Mutability: The files are fully mutable on all nodes, i.e. writeable everywhere. Platform independence: Files may be shared between disparate platforms. Support for real world conditions Tolerance of network partitions. Long-term network partitions are a feature of today’s Internet, where NAT boxes, firewalls and wireless radio shadows are commo nplace, and short-term partitions can occur when using unreliable IP multicast. Certain highly mobile nodes (e.g. Adam’s laptop in the scenario above) intermittently move between network ‘islands’ and act as a ‘bridge’ between network partitions. Resilience: Very high tolerance to node crashes and a dynamically changing group membership, which is an intrinsic characteristic of ad-hoc wireless networking environments. -Disconnected operation: All files should remain accessible while disconnected from the network. Scalability: Ideally, the Clique system should scale to 1,000 nodes, as well as large file sizes (1 GB) and large numbers of files in the system (up to 10,000). Additional desirable properties No lost updates semantics: This prevents the system from losing a modification done on any node. In the worst case, conflicting modifications are saved in alternate locations and notifications are sent to the appropriate node for manual user correction. Any-to-any update transfer: Each peer must receive all the updates issued by other peers, even if the original issuers are not directly reachable. In other words, we wish to achieve epidemic replication of the volume contents [2]. Convergence: If no updates occur in the system, and if nodes are reasonably interconnected (i.e. there are no partitions in the peer connectivity graph), then all peers will converge to the same replicated state [3].
[1]
P. Cederqvist,et al.
Version Management with CVS
,
1993
.
[2]
Peter Druschel,et al.
Storage management and caching in PAST
,
2001
.
[3]
Divyakant Agrawal,et al.
Epidemic algorithms in replicated databases (extended abstract)
,
1997,
PODS.
[4]
William I. Nowicki,et al.
NFS: Network File System Protocol specification
,
1989,
RFC.
[5]
Marvin Theimer,et al.
Designing and implementing asynchronous collaborative applications with Bayou
,
1997,
UIST '97.
[6]
Antony I. T. Rowstron,et al.
Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility
,
2001,
SOSP.
[7]
Mark Handley,et al.
A scalable content-addressable network
,
2001,
SIGCOMM '01.
[8]
Vic Stenning,et al.
A Data Transfer Protocol
,
1976,
Comput. Networks.
[9]
Peter Druschel,et al.
Pastry: Scalable, distributed object location and routing for large-scale peer-to-
,
2001
.
[10]
Alley Stoughton,et al.
Detection of Mutual Inconsistency in Distributed Systems
,
1983,
IEEE Transactions on Software Engineering.
[11]
Ashish Goel,et al.
Perspectives on optimistically replicated, peer‐to‐peer filing
,
1998,
Softw. Pract. Exp..
[12]
Keith Marzullo,et al.
Directional Gossip: Gossip in a Wide Area Network
,
1999,
EDCC.
[13]
Richard A. Golding,et al.
Weak-consistency group communication and membership
,
1992
.
[14]
David R. Karger,et al.
Chord: A scalable peer-to-peer lookup service for internet applications
,
2001,
SIGCOMM '01.
[15]
Doug Terry,et al.
Epidemic algorithms for replicated database maintenance
,
1988,
OPSR.
[16]
Dorota M. Huizinga,et al.
Experience with Connected and Disconnected Operation of Portable Notebook Computers in Distributed Systems
,
1994,
1994 First Workshop on Mobile Computing Systems and Applications.
[17]
Ben Y. Zhao,et al.
OceanStore: an architecture for global-scale persistent storage
,
2000,
SIGP.
[18]
F. Alajaji,et al.
c ○ Copyright by
,
1998
.
[19]
Dennis Shasha,et al.
The dangers of replication and a solution
,
1996,
SIGMOD '96.
[20]
Ben Y. Zhao,et al.
An Infrastructure for Fault-tolerant Wide-area Location and Routing
,
2001
.
[21]
Marc Shapiro,et al.
Replication: Optimistic Approaches
,
2002
.
[22]
Antony I. T. Rowstron,et al.
Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems
,
2001,
Middleware.
[23]
Robert H. Thomas,et al.
A Majority consensus approach to concurrency control for multiple copy databases
,
1979,
ACM Trans. Database Syst..
[24]
李幼升,et al.
Ph
,
1989
.
[25]
Jörg Liebeherr,et al.
Application-layer multicasting with Delaunay triangulation overlays
,
2002,
IEEE J. Sel. Areas Commun..
[26]
Irene Greif,et al.
Replicated document management in a group communication system
,
1988,
CSCW '88.
[27]
Antony I. T. Rowstron,et al.
The IceCube approach to the reconciliation of divergent replicas
,
2001,
PODC '01.
[28]
Nicolas Vidot,et al.
Copies convergence in a distributed real-time collaborative environment
,
2000,
CSCW '00.
[29]
Mahadev Satyanarayanan,et al.
Disconnected Operation in the Coda File System
,
1999,
Mobidata.
[30]
Mahadev Satyanarayanan,et al.
Coda: a highly available file system for a distributed workstation environment
,
1989,
Proceedings of the Second Workshop on Workstation Operating Systems.
[31]
John S. Heidemann,et al.
Resolving File Conflicts in the Ficus File System
,
1994,
USENIX Summer.
[32]
Peter L. Reiher,et al.
A simulation evaluation of optimistic replicated filing in mobile environments
,
1999,
1999 IEEE International Performance, Computing and Communications Conference (Cat. No.99CH36305).
[33]
Mahadev Satyanarayanan,et al.
Coda: A Highly Available File System for a Distributed Workstation Environment
,
1990,
IEEE Trans. Computers.