Deister: A Light-Weight Autonomous Block Management in Data-Intensive File Systems Using Deterministic Declustering Distribution

During the last few decades, Data-intensive File Systems (DiFS), such as Google File System (GFS) and Hadoop Distributed File System (HDFS) have become the key storage architectures for big data processing. These storage systems usually divide files into fixed-sized blocks (or chunks). Each block is replicated (usually three-way) and distributed pseudo-randomly across the cluster. The master node (namenode) uses a huge table to record the locations of each block and its replicas. However, with the increasing size of the data, the block location table and its corresponding maintenance could occupy more than half of the memory space and 30% of processing capacity in master node, which severely limit the scalability and performance of master node. We argue that the physical data distribution and maintenance should be separated out from the metadata management and performed by each storage node autonomously. In this paper, we propose Deister, a novel block management scheme that is built on an invertible deterministic declustering distribution method called Intersected Shifted Declustering (ISD). Deister is amendable to current research on scaling the namespace management in master node. In Deister, the huge table for maintaining the block locations in the master node are eliminated and the maintenance of the block-node mapping is performed autonomously on each data node. Results show that as compared with the HDFS default configuration, Deister is able to achieve identical performance with a saving of about half of the RAM space in master node and is expected to scale to double the size of current single namenode HDFS cluster, pushing the scalability bottleneck of master node back to namespace management.

[1]  S.A. Brandt,et al.  CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[2]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[3]  Jun Wang,et al.  Shifted declustering: a placement-ideal layout scheme for multi-way replication storage architecture , 2008, ICS '08.

[4]  Ethan L. Miller,et al.  Replication under scalable hashing: a family of algorithms for scalable decentralized data distribution , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[5]  Witold Litwin,et al.  Linear Hashing: A new Algorithm for Files and Tables Addressing , 1980, ICOD.

[6]  Philip S. Yu,et al.  Using rotational mirrored declustering for replica placement in a disk-array-based video server , 1997, MULTIMEDIA '95.

[7]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[8]  Jun Wang,et al.  A Scalable Reverse Lookup Scheme Using Group-Based Shifted Declustering Layout , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[9]  Witold Litwin,et al.  LH*—a scalable, distributed data structure , 1996, TODS.

[10]  Konstantin V. Shvachko,et al.  HDFS Scalability: The Limits to Growth , 2010, login Usenix Mag..

[11]  David J. DeWitt,et al.  Chained declustering: a new availability strategy for multiprocessor database machines , 1990, [1990] Proceedings. Sixth International Conference on Data Engineering.

[12]  GhemawatSanjay,et al.  The Google file system , 2003 .

[13]  Ronald Fagin,et al.  Extendible hashing—a fast access method for dynamic files , 1979, ACM Trans. Database Syst..

[14]  Philip S. Yu,et al.  Using rotational mirrored declustering for replica placement in a disk-array-based video server , 1997, Multimedia Systems.

[15]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[16]  Mike Hibler,et al.  An integrated experimental environment for distributed systems and networks , 2002, OSDI '02.

[17]  Garth A. Gibson,et al.  PRObE: A Thousand-Node Experimental Cluster for Computer Systems Research , 2013, login Usenix Mag..

[18]  Tom W. Keller,et al.  A comparison of high-availability media recovery techniques , 1989, SIGMOD '89.

[19]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[20]  Sriram Rao,et al.  A The Quantcast File System , 2013, Proc. VLDB Endow..

[21]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[22]  Tore Risch,et al.  LH*G: A High-Availability Scalable Distributed Data Structure By Record Grouping , 2002, IEEE Trans. Knowl. Data Eng..