MAPX: Controlled Data Migration in the Expansion of Decentralized Object-Based Storage Systems

Data placement is critical for the scalability of decentralized object-based storage systems. The state-of-the-art CRUSH placement method is a decentralized algorithm that deterministically places object replicas onto storage devices without relying on a central directory. While enjoying the benefits of decentralization such as high scalability, robustness, and performance, CRUSH-based storage systems suffer from uncontrolled data migration when expanding the clusters, which will cause significant performance degradation when the expansion is nontrivial. This paper presents MAPX, a novel extension to CRUSH that uses an extra time-dimension mapping (from object creation times to cluster expansion times) for controlled data migration in cluster expansions. Each expansion is viewed as a new layer of the CRUSH map represented by a virtual node beneath the CRUSH root. MAPX controls the mapping from objects onto layers by manipulating the timestamps of the intermediate placement groups (PGs). MAPX is applicable to a large variety of object-based storage scenarios where object timestamps can be maintained as higher-level metadata. For example, we apply MAPX to Ceph-RBD by extending the RBD metadata structure to maintain and retrieve approximate object creation times at the granularity of expansions layers. Experimental results show that the MAPX-based migration-free system outperforms the CRUSH-based system (which is busy in migrating objects after expansions) by up to 4.25× in the tail latency.

[1]  Willy Zwaenepoel,et al.  Optimistic Causal Consistency for Geo-Replicated Key-Value Stores , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[2]  Bin Fan,et al.  SILT: a memory-efficient, high-performance key-value store , 2011, SOSP.

[3]  Andrea C. Arpaci-Dusseau,et al.  WiscKey: Separating Keys from Values in SSD-conscious Storage , 2016, FAST.

[4]  Ben Y. Zhao,et al.  Tapestry: a resilient global-scale overlay for service deployment , 2004, IEEE Journal on Selected Areas in Communications.

[5]  David R. Karger,et al.  Koorde: A Simple Degree-Optimal Distributed Hash Table , 2003, IPTPS.

[6]  Andrea C. Arpaci-Dusseau,et al.  Consistency without ordering , 2012, FAST.

[7]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[8]  Dutch T. Meyer,et al.  Parallax: virtual disks for virtual machines , 2008, Eurosys '08.

[9]  Kai Ren,et al.  IndexFS: Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Xiaohui Liu,et al.  PARIX: Speculative Partial Writes in Erasure-Coded Systems , 2017, USENIX Annual Technical Conference.

[11]  José M. García,et al.  DualFS: a new journaling file system without meta-data duplication , 2002, ICS '02.

[12]  Ling Liu,et al.  Leveraging Glocality for Fast Failure Recovery in Distributed RAM Storage , 2019, ACM Trans. Storage.

[13]  Dirk Grunwald,et al.  A performance analysis of the iSCSI protocol , 2003, 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings..

[14]  Jin Li,et al.  FlashStore , 2010, Proc. VLDB Endow..

[15]  Michael B. Jones,et al.  SkipNet: A Scalable Overlay Network with Practical Locality Properties , 2003, USENIX Symposium on Internet Technologies and Systems.

[16]  Indranil Gupta,et al.  Ambry: LinkedIn's Scalable Geo-Distributed Object Store , 2016, SIGMOD Conference.

[17]  Jin Li,et al.  SkimpyStash: RAM space skimpy key-value store on flash-based storage , 2011, SIGMOD '11.

[18]  Liang Liang,et al.  Research on data migration optimization of ceph , 2017, 2017 14th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP).

[19]  Guihai Chen,et al.  Cycloid: a constant-degree and lookup-efficient P2P overlay network , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[20]  David R. Karger,et al.  Diminished Chord: A Protocol for Heterogeneous Subgroup Formation in Peer-to-Peer Networks , 2004, IPTPS.

[21]  Prashant Malik,et al.  Cassandra: structured storage system on a P2P network , 2009, PODC '09.

[22]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[23]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[24]  Krishna P. Gummadi,et al.  Canon in G major: designing DHTs with hierarchical structure , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..

[25]  Peter Honeyman,et al.  Exporting storage systems in a scalable manner with pNFS , 2005, 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST'05).

[26]  Amar Phanishayee,et al.  FAWN: a fast array of wimpy nodes , 2009, SOSP '09.

[27]  Andrea C. Arpaci-Dusseau,et al.  Analysis of HDFS under HBase: a facebook messages case study , 2014, FAST.

[28]  Asim Kadav,et al.  Blizzard: Fast, Cloud-scale Block Storage for Cloud-oblivious Applications , 2014, NSDI.

[29]  Andrew Warfield,et al.  Parallax: Managing Storage for a Million Machines , 2005, HotOS.

[30]  Mendel Rosenblum,et al.  Fast crash recovery in RAMCloud , 2011, SOSP.

[31]  Yiming Zhang,et al.  PBS: An Efficient Erasure-Coded Block Storage System Based on Speculative Partial Writes , 2020, ACM Trans. Storage.

[32]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[33]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[34]  Kai Chen,et al.  URSA: Hybrid Block Storage for Cloud-Scale Virtual Disks , 2019, EuroSys.

[35]  Miguel Castro,et al.  Debunking some myths about structured and unstructured overlays , 2005, NSDI.

[36]  Daniel Stodolsky,et al.  Parity logging overcoming the small write problem in redundant disk arrays , 1993, ISCA '93.

[37]  R. S. Fabry,et al.  A fast file system for UNIX , 1984, TOCS.

[38]  Yang Wang,et al.  Robustness in the Salus Scalable Block Store , 2013, NSDI.

[39]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[40]  Ling Liu,et al.  Distributed Line Graphs: A Universal Technique for Designing DHTs Based on Arbitrary Regular Graphs , 2012, IEEE Transactions on Knowledge and Data Engineering.

[41]  Lei Chen,et al.  Enabling routing control in a DHT , 2010, IEEE Journal on Selected Areas in Communications.

[42]  GhemawatSanjay,et al.  The Google file system , 2003 .

[43]  Chandramohan A. Thekkath,et al.  Petal: distributed virtual disks , 1996, ASPLOS VII.

[44]  Patrick P. C. Lee,et al.  Parity logging with reserved space: towards efficient updates and recovery in erasure-coded clustered storage , 2014, FAST.

[45]  Sanjeev Kumar,et al.  Finding a Needle in Haystack: Facebook's Photo Storage , 2010, OSDI.

[46]  Peter Braam,et al.  The Lustre Storage Architecture , 2019, ArXiv.

[47]  Andrea C. Arpaci-Dusseau,et al.  Optimistic crash consistency , 2013, SOSP.

[48]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[49]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[50]  Jon Howell,et al.  Flat Datacenter Storage , 2012, OSDI.

[51]  Ed L. Cashin ATA over Ethernet: Putting Hard Drives on the LAN , 2005 .

[52]  S.A. Brandt,et al.  CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[53]  Shankar Pasupathy,et al.  Measurement and Analysis of Large-Scale Network File System Workloads , 2008, USENIX Annual Technical Conference.

[54]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[55]  Hong Jiang,et al.  RAID6L: A log-assisted RAID6 storage architecture with improved write performance , 2011, 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST).

[56]  John H. Hartman,et al.  The Zebra striped network file system , 1995, TOCS.

[57]  Christopher Frost,et al.  Better I/O through byte-addressable, persistent memory , 2009, SOSP '09.

[58]  Haitao Wu,et al.  CubicRing: Exploiting Network Proximity for Distributed In-Memory Key-Value Store , 2017, IEEE/ACM Transactions on Networking.

[59]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[60]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[61]  Cory Hill,et al.  f4: Facebook's Warm BLOB Storage System , 2014, OSDI.

[62]  Suman Nath,et al.  Cheap and Large CAMs for High Performance Data-Intensive Networked Systems , 2010, NSDI.

[63]  Peter Druschel,et al.  Providing Administrative Control and Autonomy in Structured Peer-to-Peer Overlays , 2004, IPTPS.