ShardFS vs. IndexFS: replication vs. caching strategies for distributed metadata management in cloud storage systems

The rapid growth of cloud storage systems calls for fast and scalable namespace processing. While few commercial file systems offer anything better than federating individually non-scalable namespace servers, a recent academic file system, IndexFS, demonstrates scalable namespace processing based on client caching of directory entries and permissions (directory lookup state) with no per-client state in servers. In this paper we explore explicit replication of directory lookup state in all servers as an alternative to caching this information in all clients. Both eliminate most repeated RPCs to different servers in order to resolve hierarchical permission tests. Our realization for server replicated directory lookup state, ShardFS, employs a novel file system specific hybrid optimistic and pessimistic concurrency control favoring single object transactions over distributed transactions. Our experimentation suggests that if directory lookup state mutation is a fixed fraction of operations (strong scaling for metadata), server replication does not scale as well as client caching, but if directory lookup state mutation is proportional to the number of jobs, not the number of processes per job, (weak scaling for metadata), then server replication can scale more linearly than client caching and provide lower 70 percentile response times as well.

[1]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[2]  Philip A. Bernstein,et al.  Concurrency Control in Distributed Database Systems , 1986, CSUR.

[3]  Jim Gray,et al.  A critique of ANSI SQL isolation levels , 1995, SIGMOD '95.

[4]  GrayJim,et al.  A critique of ANSI SQL isolation levels , 1995 .

[5]  Chandramohan A. Thekkath,et al.  Frangipani: a scalable distributed file system , 1997, SOSP.

[6]  Jim Zelenka,et al.  A cost-effective, high-bandwidth storage architecture , 1998, ASPLOS VIII.

[7]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[8]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[9]  Miguel Castro,et al.  Farsite: federated, available, and reliable storage for an incompletely trusted environment , 2002, OPSR.

[10]  Scott A. Brandt,et al.  Efficient metadata management in large distributed storage systems , 2003, 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings..

[11]  GhemawatSanjay,et al.  The Google file system , 2003 .

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Jon Howell,et al.  Distributed directory service in the Farsite file system , 2006, OSDI '06.

[14]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[15]  Michael Stonebraker,et al.  The End of an Architectural Era (It's Time for a Complete Rewrite) , 2007, VLDB.

[16]  Marcos K. Aguilera,et al.  Sinfonia: a new paradigm for building scalable distributed systems , 2007, SOSP.

[17]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[18]  Shobhit Dayal,et al.  Characterizing HEC Storage Systems at Rest , 2008 .

[19]  Marcos K. Aguilera,et al.  A practical scalable distributed B-tree , 2008, Proc. VLDB Endow..

[20]  Bin Zhou,et al.  Scalable Performance of the Panasas Parallel File System , 2008, FAST.

[21]  John Bent,et al.  PLFS: a checkpoint filesystem for parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[22]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[23]  Wilfred Pinfold,et al.  Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis , 2009, HiPC 2009.

[24]  Konstantin V. Shvachko,et al.  HDFS Scalability: The Limits to Growth , 2010, login Usenix Mag..

[25]  David L. Mills,et al.  Internet Engineering Task Force (ietf) Network Time Protocol Version 4: Protocol and Algorithms Specification , 2010 .

[26]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[27]  Lin Xiao,et al.  YCSB++: benchmarking and performance debugging advanced features in scalable table stores , 2011, SoCC.

[28]  Lars George,et al.  HBase: The Definitive Guide , 2011 .

[29]  Ju Wang,et al.  Windows Azure Storage: a highly available cloud storage service with strong consistency , 2011, SOSP.

[30]  Garth A. Gibson,et al.  Scale and Concurrency of GIGA+: File System Directories with Millions of Files , 2011, FAST.

[31]  Bin Fan,et al.  Small cache, big effect: provable load balancing for randomly partitioned cluster services , 2011, SoCC.

[32]  Garth Gibson,et al.  Otus: resource attribution in data-intensive clusters , 2011, MapReduce '11.

[33]  Dutch T. Meyer,et al.  A study of practical deduplication , 2011, TOS.

[34]  Cristina L. Abad,et al.  Metadata Traces and Workload Models for Evaluating Big Storage Systems , 2012, 2012 IEEE Fifth International Conference on Utility and Cloud Computing.

[35]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[36]  Christopher Scherb The Hadoop File System , 2012 .

[37]  Garth A. Gibson,et al.  PRObE: A Thousand-Node Experimental Cluster for Computer Systems Research , 2013, login Usenix Mag..

[38]  Brent Welch,et al.  Optimizing a hybrid SSD/HDD HPC storage system based on file size distributions , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[39]  Kai Ren,et al.  TABLEFS: Enhancing Metadata Efficiency in the Local File System , 2013, USENIX Annual Technical Conference.

[40]  Kai Ren,et al.  IndexFS: Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[41]  Dutch T. Meyer,et al.  Strata: High-Performance Scalable Storage on Virtualized Non-volatile Memory , 2014, FAST 2014.

[42]  David G. Andersen,et al.  Paxos Quorum Leases: Fast Reads Without Sacrificing Writes , 2014, SoCC.

[43]  Andrea C. Arpaci-Dusseau,et al.  Analysis of HDFS under HBase: a facebook messages case study , 2014, FAST.

[44]  Dutch T. Meyer,et al.  Strata: scalable high-performance storage on virtualized non-volatile memory , 2014, FAST.

[45]  Daniel J. Abadi,et al.  CalvinFS: Consistent WAN Replication and Scalable Metadata Management for Distributed File Systems , 2015, FAST.