Optimizing Hash-based Distributed Storage Using Client Choices

Many distributed storage systems use hash-based methods for block placement. While hashing improves scalability, it lacks the flexibility that modern applications need for performance optimization. We propose CHOICE, a design allowing clients to have multiple choices for block placement. It also provides the client with relevant server performance metrics so the clients can implement their own choice policy for performance optimization such as choosing better locality or less busy servers. CHOICE requires minimal changes to the storage server and thus easy to deploy. We have implemented it in Ceph, a popular open-source distributed storage system. On two real Ceph clusters with 45 and 176 disks respectively, we show that we can greatly improve performance using the right placement policy.

[1]  Song Jiang,et al.  LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data Items , 2015, USENIX Annual Technical Conference.

[2]  Garth A. Gibson,et al.  Scale and Concurrency of GIGA+: File System Directories with Millions of Files , 2011, FAST.

[3]  Carlos Maltzahn,et al.  RADOS: a scalable, reliable storage service for petabyte-scale storage clusters , 2007, PDSW '07.

[4]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[5]  Marcos K. Aguilera,et al.  Consistency-based service level agreements for cloud storage , 2013, SOSP.

[6]  S.A. Brandt,et al.  CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[7]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[8]  Friedhelm Meyer auf der Heide,et al.  Efficient PRAM simulation on a distributed memory machine , 1992, STOC '92.

[9]  Lin Xiao,et al.  ShardFS vs. IndexFS: replication vs. caching strategies for distributed metadata management in cloud storage systems , 2015, SoCC.

[10]  Daniel J. Abadi,et al.  CalvinFS: Consistent WAN Replication and Scalable Metadata Management for Distributed File Systems , 2015, FAST.

[11]  Jon Howell,et al.  Flat Datacenter Storage , 2012, OSDI.

[12]  GhemawatSanjay,et al.  The Google file system , 2003 .

[13]  Douglas B. Terry,et al.  A Self-Configurable Geo-Replicated Cloud Storage System , 2014, OSDI.

[14]  Friedhelm Meyer auf der Heide,et al.  Exploiting Storage Redundancy to Speed up Randomized Shared Memory Simulations , 1995, Theor. Comput. Sci..

[15]  Nisha Talagala,et al.  NVMKV: A Scalable, Lightweight, FTL-aware Key-Value Store , 2015, USENIX Annual Technical Conference.

[16]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[17]  Friedhelm Meyer auf der Heide,et al.  Efficient PRAM simulation on a distributed memory machine , 1992, STOC '92.

[18]  Cory Hill,et al.  f4: Facebook's Warm BLOB Storage System , 2014, OSDI.

[19]  Andrei Z. Broder,et al.  Using multiple hash functions to improve IP lookups , 2001, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213).

[20]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[21]  Ion Stoica,et al.  The Power of Choice in Data-Aware Cluster Scheduling , 2014, OSDI.

[22]  Michael Mitzenmacher,et al.  The Power of Two Choices in Randomized Load Balancing , 2001, IEEE Trans. Parallel Distributed Syst..

[23]  Srikanth Kandula,et al.  Leveraging endpoint flexibility in data-intensive clusters , 2013, SIGCOMM.

[24]  Shigang Chen,et al.  Load balancing with multiple hash functions in peer-to-peer networks , 2006, 12th International Conference on Parallel and Distributed Systems - (ICPADS'06).

[25]  Kai Ren,et al.  IndexFS: Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.