Scaling HDFS with a Strongly Consistent Relational Model for Metadata

The Hadoop Distributed File System HDFS scales to store tens of petabytes of data despite the fact that the entire file system's metadata must fit on the heap of a single Java virtual machine. The size of HDFS' metadata is limited to under 100 GB in production, as garbage collection events in bigger clusters result in heartbeats timing out to the metadata server NameNode. In this paper, we address the problem of how to migrate the HDFS' metadata to a relational model, so that we can support larger amounts of storage on a shared-nothing, in-memory, distributed database. Our main contribution is that we show how to provide at least as strong consistency semantics as HDFS while adding support for a multiple-writer, multiple-reader concurrency model. We guarantee freedom from deadlocks by logically organizing inodes and their constituent blocks and replicas into a hierarchy and having all metadata operations agree on a global order for acquiring both explicit locks and implicit locks on subtrees in the hierarchy. We use transactions with pessimistic concurrency control to ensure the safety and progress of metadata operations. Finally, we show how to improve performance of our solution by introducing a snapshotting mechanism at NameNodes that minimizes the number of roundtrips to the database.

[1]  Michael Stonebraker,et al.  New opportunities for New SQL , 2012, CACM.

[2]  Bo Dong,et al.  Hadoop high availability through metadata replication , 2009, CloudDB@CIKM.

[3]  Douglas B. Terry,et al.  Peer-to-Peer Replication in WinFS , 2006 .

[4]  Konstantin V. Shvachko,et al.  HDFS Scalability: The Limits to Growth , 2010, login Usenix Mag..

[5]  Michael Isard,et al.  TidyFS: A Simple and Small Distributed File System , 2011, USENIX Annual Technical Conference.

[6]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[7]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[8]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[9]  Andrew J. Hutton,et al.  Lustre: Building a File System for 1,000-node Clusters , 2003 .

[10]  Sean Quinlan,et al.  GFS: Evolution on Fast-forward , 2009, ACM Queue.

[11]  Mikael Ronström,et al.  Recovery Principles in MySQL Cluster 5.1 , 2005, VLDB.

[12]  GhemawatSanjay,et al.  The Google file system , 2003 .

[13]  Andrea C. Arpaci-Dusseau,et al.  SQCK: A Declarative File System Checker , 2008, OSDI.

[14]  Frank Yellin,et al.  The Java Virtual Machine Specification , 1996 .

[15]  Daniel J. Abadi,et al.  Calvin: fast distributed transactions for partitioned database systems , 2012, SIGMOD Conference.

[16]  Pascal Felber,et al.  Evaluating the Price of Consistency in Distributed File Storage Services , 2013, DAIS.

[17]  Jim Gray,et al.  A critique of ANSI SQL isolation levels , 1995, SIGMOD '95.

[18]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[19]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[20]  Irving L. Traiger,et al.  Granularity of Locks and Degrees of Consistency in a Shared Data Base , 1998, IFIP Working Conference on Modelling in Data Base Management Systems.