Coupling Right-Provisioned Cold Storage Data Centers with Deduplication

Modern cloud-scale cold storage data centers have begun to support right-provisioning of a rack’s resources (power, cooling, etc.), which allows only a small fraction of all hard disks to be active (spinning) concurrently at any given time to reduce the cost of ownership. Data deduplication is a traditional approach to split files into chunks and eliminate duplicate chunks, which can also cut costs for cold storage systems. However, when combined with right-provisioning, classical deduplication may make a file deduplicated and stored across the disks some of which are not active currently, thus leading to unacceptable access performance caused by spinning up and down of the disks. In this paper, we analyze the deduplication ratio under real-world workloads of cloud cold storage and observe for most workloads: 1) the deduplication ratio generally increases quickly with the first few of versions of the workload, and 2) increases slowly but steadily with the subsequent versions as a long tail. Based on the first observation, we propose an online deduplication way that can improve the deduplication ratio while providing acceptable read performance; based on the second one, we propose an additional offline deduplication way that can achieve comparable deduplication ratios with classical deduplication. We design a cold storage system called DeCold via combining the above two deduplication ways as well as improving deduplication efficiency. We prototype DeCold and conduct testbed experiments on real-world datasets including source code, virtual machine and database. Evaluations show that DeCold achieves better file access performance over the classical deduplication implementation, while maintaining decent deduplication efficiency.

[1]  Erez Zadok,et al.  Dmdedup : Device Mapper Target for Data Deduplication , 2014 .

[2]  Antony I. T. Rowstron,et al.  Pelican: A Building Block for Exascale Cold Data Storage , 2014, OSDI.

[3]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[4]  Medha Bhadkamkar,et al.  Identifying Trends in Enterprise Data Protection Systems , 2015, USENIX Annual Technical Conference.

[5]  Hong Jiang,et al.  AE: An Asymmetric Extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[6]  Carl A. Waldspurger,et al.  Memory resource management in VMware ESX server , 2002, OSDI '02.

[7]  Dan Feng,et al.  Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information , 2014, USENIX Annual Technical Conference.

[8]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[9]  Yucheng Zhang,et al.  Design Tradeoffs for Data Deduplication Performance in Backup Workloads , 2015, FAST.

[10]  Matei Ripeanu,et al.  VMFlock: virtual machine co-migration for the cloud , 2011, HPDC '11.

[11]  David Mazières,et al.  A low-bandwidth network file system , 2001, SOSP.

[12]  Mark Lillibridge,et al.  Improving restore speed for backup systems that use inline chunk-based deduplication , 2013, FAST.

[13]  Bin Yan,et al.  R-ADMAD: high reliability provision for large-scale de-duplication archival storage systems , 2009, ICS '09.

[14]  Michal Kaczmarczyk,et al.  HYDRAstor: A Scalable Secondary Storage , 2009, FAST.

[15]  Aleksey Pesterev,et al.  Fast, Inexpensive Content-Addressed Storage in Foundation , 2008, USENIX Annual Technical Conference.

[16]  Fang Wang,et al.  A Read-leveling Data Distribution Scheme for Promoting Read Performance in SSDs with Deduplication , 2019, ICPP.

[17]  Fred Douglis,et al.  Characteristics of backup workloads in production systems , 2012, FAST.

[18]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[19]  C. Waldspurger Memory resource management in VMware ESX server , 2002, OSDI '02.

[20]  I. Reed,et al.  Polynomial Codes Over Certain Finite Fields , 1960 .

[21]  Yuchong Hu,et al.  EEC-Dedup: Efficient Erasure-Coded Deduplicated Backup Storage Systems , 2017, 2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC).

[22]  Hong Jiang,et al.  SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput , 2011, USENIX Annual Technical Conference.

[23]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[24]  Jie Yao,et al.  ROS , 2018, ACM Transactions on Storage.

[25]  Haiying Shen,et al.  A popularity-aware cost-effective replication scheme for high data durability in cloud storage , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[26]  Adam Mendoza Cold Storage in the Cloud : Trends , Challenges , and Solutions , 2013 .

[27]  Antony I. T. Rowstron,et al.  Feeding the Pelican: Using Archival Hard Drives for Cold Storage Racks , 2016, HotStorage.

[28]  Michael Vrable,et al.  Cumulus: Filesystem backup to the cloud , 2009, TOS.

[29]  Hong Jiang,et al.  ROS: A Rack-based Optical Storage System with Inline Accessibility for Long-Term Data Preservation , 2017, EuroSys.

[30]  Xiaozhou Li,et al.  Flamingo: Enabling Evolvable HDD-based Near-Line Storage , 2016, FAST.

[31]  Anastasia Ailamaki,et al.  Cheap Data Analytics using Cold Storage Devices , 2016, Proc. VLDB Endow..