Reclaiming space from duplicate files in a serverless distributed file system

The Farsite distributed file system provides availability by replicating each file onto multiple desktop computers. Since this replication consumes significant storage space, it is important to reclaim used space where possible. Measurement of over 500 desktop file systems shows that nearly half of all consumed space is occupied by duplicate files. We present a mechanism to reclaim space from this incidental duplication to make it available for controlled file replication. Our mechanism includes: (1) convergent encryption, which enables duplicate files to be coalesced into the space of a single file, even if the files are encrypted with different users' keys; and (2) SALAD, a Self-Arranging Lossy Associative Database for aggregating file content and location information in a decentralized, scalable, fault-tolerant manner. Large-scale simulation experiments show that the duplicate-file coalescing system is scalable, highly effective, and fault-tolerant.

[1]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[2]  O. Sheng Dynamic File Migration in Distributed Computer Systems , 1990, ICIS.

[3]  Mahadev Satyanarayanan,et al.  Scale and performance in a distributed file system , 1987, SOSP '87.

[4]  Oivind Kure Optimization of file migration in distributed systems , 1988 .

[5]  Joel L. Wolf,et al.  The placement optimization program: a practical solution to the disk file assignment problem , 1989, SIGMETRICS '89.

[6]  Kenneth P. Birman,et al.  Deceit: a flexible distributed file system , 1990, [1990] Proceedings. Workshop on the Management of Replicated Data.

[7]  Ray Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[8]  John S. Heidemann,et al.  Implementation of the Ficus Replicated File System , 1990, USENIX Summer.

[9]  Helen Custer,et al.  Inside Windows NT , 1992 .

[10]  Daniel L. McCue,et al.  Computing replica placement in distributed systems , 1992, [1992 Proceedings] Second Workshop on the Management of Replicated Data.

[11]  Ralph E. Droms,et al.  DHCP Options and BOOTP Vendor Extensions , 1993, RFC.

[12]  Ralph E. Droms,et al.  Dynamic Host Configuration Protocol , 1993, RFC.

[13]  Mihir Bellare,et al.  Random oracles are practical: a paradigm for designing efficient protocols , 1993, CCS '93.

[14]  Paul C. Kocher,et al.  Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems , 1996, CRYPTO.

[15]  Franco Zambonelli,et al.  Experience of adaptive replication in distributed file systems , 1996, Proceedings of EUROMICRO 96. 22nd Euromicro Conference. Beyond 2000: Hardware and Software Design Strategies.

[16]  Eli Biham,et al.  Two Practical and Provably Secure Block Ciphers: BEARS and LION , 1996, FSE.

[17]  Alfred Menezes,et al.  Handbook of Applied Cryptography , 2018 .

[18]  Chandramohan A. Thekkath,et al.  Frangipani: a scalable distributed file system , 1997, SOSP.

[19]  Eli Biham,et al.  Differential Fault Analysis of Secret Key Cryptosystems , 1997, CRYPTO.

[20]  Rajmohan Rajaraman,et al.  Accessing Nearby Copies of Replicated Objects in a Distributed Environment , 1997, SPAA '97.

[21]  Andrew V. Goldberg,et al.  Towards an archival Intermemory , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[22]  Miguel Oom Temudo de Castro,et al.  Practical Byzantine fault tolerance , 1999, OSDI '99.

[23]  William J. Bolosky,et al.  A large-scale study of file-system contents , 1999, SIGMETRICS '99.

[24]  William J. Bolosky,et al.  Single instance storage in Windows® 2000 , 2000 .

[25]  Aviel D. Rubin,et al.  Publius: a robust, tamper-evident, censorship-resistant web publishing system , 2000 .

[26]  Bruce Schneier,et al.  Side Channel Cryptanalysis of Product Ciphers , 1998, J. Comput. Secur..

[27]  Christian Scheideler,et al.  Efficient, distributed data placement strategies for storage area networks (extended abstract) , 2000, SPAA '00.

[28]  Dawn Xiaodong Song,et al.  Practical techniques for searches on encrypted data , 2000, Proceeding 2000 IEEE Symposium on Security and Privacy. S&P 2000.

[29]  Ian Clarke,et al.  Freenet: A Distributed Anonymous Information Storage and Retrieval System , 2000, Workshop on Design Issues in Anonymity and Unobservability.

[30]  William J. Bolosky,et al.  Single Instance Storage in Windows , 2000 .

[31]  Marvin Theimer,et al.  Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs , 2000, SIGMETRICS '00.

[32]  Peter Druschel,et al.  Pastry: Scalable, distributed object location and routing for large-scale peer-to- , 2001 .

[33]  MaziéresDavid,et al.  A low-bandwidth network file system , 2001 .

[34]  Ben Y. Zhao,et al.  An Infrastructure for Fault-tolerant Wide-area Location and Routing , 2001 .

[35]  Stefan Saroiu,et al.  A Measurement Study of Peer-to-Peer File Sharing Systems , 2001 .

[36]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[37]  Roger Wattenhofer,et al.  Optimizing file availability in a secure serverless distributed file system , 2001, Proceedings 20th IEEE Symposium on Reliable Distributed Systems.

[38]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[39]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[40]  Ben Y. Zhao,et al.  Tapestry: a fault-tolerant wide-area application infrastructure , 2002, CCRV.

[41]  Siva Sai Yerubandi,et al.  Differential Power Analysis , 2002 .