Scale and Concurrency of GIGA+: File System Directories with Millions of Files

We examine the problem of scalable file system directories, motivated by data-intensive applications requiring millions to billions of small files to be ingested in a single directory at rates of hundreds of thousands of file creates every second. We introduce a POSIX-compliant scalable directory design, GIGA+, that distributes directory entries over a cluster of server nodes. For scalability, each server makes only local, independent decisions about migration for load balancing. GIGA+ uses two internal implementation tenets, asynchrony and eventual consistency, to: (1) partition an index among all servers without synchronization or serialization, and (2) gracefully tolerate stale index state at the clients. Applications, however, are provided traditional strong synchronous consistency semantics. We have built and demonstrated that the GIGA+ approach scales better than existing distributed directory implementations, delivers a sustained throughput of more than 98,000 file creates per second on a 32-server cluster, and balances load more efficiently than consistent hashing.

[1]  Ronald Fagin,et al.  Extendible hashing—a fast access method for dynamic files , 1979, ACM Trans. Database Syst..

[2]  Witold Litwin,et al.  Linear Hashing: A new Algorithm for Files and Tables Addressing , 1980, ICOD.

[3]  Sun Microsystems,et al.  RPC: Remote Procedure Call Protocol specification , 1988, RFC.

[4]  David J. DeWitt,et al.  Chained declustering: a new availability strategy for multiprocessor database machines , 1990, [1990] Proceedings. Sixth International Conference on Data Engineering.

[5]  Ronald L. Rivest,et al.  The MD5 Message-Digest Algorithm , 1992, RFC.

[6]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[7]  John H. Hartman,et al.  Zebra: A Striped Network File System , 1992 .

[8]  John H. Hartman,et al.  The Zebra striped network file system , 1993, SOSP '93.

[9]  Witold Litwin,et al.  LH* - Linear Hashing for Distributed Files , 1993, SIGMOD Conference.

[10]  Raj Srinivasan,et al.  RPC: Remote Procedure Call Protocol Specification Version 2 , 1995, RFC.

[11]  John H. Hartman,et al.  The Zebra striped network file system , 1995, TOCS.

[12]  Chandramohan A. Thekkath,et al.  Petal: distributed virtual disks , 1996, ASPLOS VII.

[13]  Witold Litwin,et al.  LH*—a scalable, distributed data structure , 1996, TODS.

[14]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[15]  Chandramohan A. Thekkath,et al.  Frangipani: a scalable distributed file system , 1997, SOSP.

[16]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[17]  Jim Zelenka,et al.  High-bandwidth storage architecture , 1998, ASPLOS 1998.

[18]  Jim Zelenka,et al.  A cost-effective, high-bandwidth storage architecture , 1998, ASPLOS VIII.

[19]  David E. Culler,et al.  Scalable, distributed data structures for internet service construction , 2000, OSDI.

[20]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[21]  Antony I. T. Rowstron,et al.  Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility , 2001, SOSP.

[22]  Robert Morris,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM 2001.

[23]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[24]  Hai Jin,et al.  The Zebra Striped Network File System , 2002 .

[25]  Thomer M. Gil,et al.  Ivy: a read/write peer-to-peer file system , 2002, OSDI '02.

[26]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[27]  Robert Tappan Morris,et al.  Ivy: a read/write peer-to-peer file system , 2002, OSDI '02.

[28]  Jeffrey Considine,et al.  Simple Load Balancing for Distributed Hash Tables , 2003, IPTPS.

[29]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[30]  Ben Y. Zhao,et al.  Pond: The OceanStore Prototype , 2003, FAST.

[31]  GhemawatSanjay,et al.  The Google file system , 2003 .

[32]  Ben Y. Zhao,et al.  Awarded Best Student Paper! - Pond: The OceanStore Prototype , 2003 .

[33]  Scott A. Brandt,et al.  Dynamic Metadata Management for Petabyte-Scale File Systems , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[34]  Marc Najork,et al.  Boxwood: Abstractions as the Foundation for Storage Infrastructure , 2004, OSDI.

[35]  Margo I. Seltzer,et al.  Beyond Relational Databases , 2005, ACM Queue.

[36]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[37]  Jon Howell,et al.  Distributed directory service in the Farsite file system , 2006, OSDI '06.

[38]  Michael Burrows,et al.  The Chubby Lock Service for Loosely-Coupled Distributed Systems , 2006, OSDI.

[39]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[40]  Jacob R. Lorch,et al.  A five-year study of file-system metadata , 2007, TOS.

[41]  Brent Welch Integrated system models for reliable petascale storage systems , 2007, PDSW '07.

[42]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[43]  Peter J. Braam Scalable locking and recovery for network file systems , 2007, PDSW '07.

[44]  Shobhit Dayal,et al.  Characterizing HEC Storage Systems at Rest , 2008 .

[45]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[46]  Shobhit Dayal,et al.  Characterizing HEC Storage Systems at Rest (CMU-PDL-08-109) , 2008 .

[47]  Bin Zhou,et al.  Scalable Performance of the Panasas Parallel File System , 2008, FAST.

[48]  Beng Chin Ooi,et al.  The Claremont report on database research , 2008, SGMD.

[49]  John Bent,et al.  PLFS: a checkpoint filesystem for parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[50]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[51]  Gregory R. Ganger,et al.  A Transparently-Scalable Metadata Service for the Ursa Minor Storage System , 2010, USENIX Annual Technical Conference.

[52]  Sanjeev Kumar,et al.  Finding a Needle in Haystack: Facebook's Photo Storage , 2010, OSDI.

[53]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[54]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[55]  K Fitzgerald,et al.  Comparison of Leading Parallel NAS File Systems on Commodity Hardware , 2010, PDSW 2010.

[56]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.