SAR: SSD Assisted Restore Optimization for Deduplication-Based Storage Systems in the Cloud

The explosive growth of digital content results in enormous strains on the storage systems in the cloud environment. The data deduplication technology has been demonstrated to be very effective in shortening the backup window and saving the network bandwidth and storage space in cloud backup, archiving and primary storage systems such as VM platforms. However, the delay and power consumption of the restore operations from a deduplicated storage can be significantly higher than those without deduplication. The main reason lies in the fact that a file or block is split into multiple small data chunks that are often located in non-sequential locations on HDDs after deduplication, which can cause a subsequent read operation to invoke many HDD I/O requests involving multiple disk seeks. To address this problem, in this paper we propose SAR, an SSD Assisted Restore scheme, that effectively exploits the high random-read performance and low power-consumption properties of SSDs and the unique data sharing characteristic of deduplication-based storage system by storing in SSDs the unique data chunks with high reference count, small size and non-sequential characteristics. In this way, many critical random-read requests to HDDs are replaced by read requests to SSDs, thus significantly improving the system performance and energy efficiency. The extensive trace-driven and VM restore evaluations on the prototype implementation of SAR show that SAR outperforms the traditional deduplication-based schemes significantly, in terms of both restore performance and energy efficiency.

[1]  Himabindu Pucha,et al.  Cost Effective Storage using Extent Based Dynamic Tiering , 2011, FAST.

[2]  M. Polte,et al.  Comparing performance of solid state devices and mechanical disks , 2008, 2008 3rd Petascale Data Storage Workshop.

[3]  Cezary Dubnicki,et al.  Bimodal Content Defined Chunking for Backup Streams , 2010, FAST.

[4]  Amar Phanishayee,et al.  FAWN: a fast array of wimpy nodes , 2009, SOSP '09.

[5]  Aleksey Pesterev,et al.  Fast, Inexpensive Content-Addressed Storage in Foundation , 2008, USENIX Annual Technical Conference.

[6]  Hong Jiang,et al.  CABdedupe: A Causality-Based Deduplication Performance Booster for Cloud Backup Services , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[7]  Dutch T. Meyer,et al.  A study of practical deduplication , 2011, TOS.

[8]  Eric Jul,et al.  Lithium: virtual machine storage for the cloud , 2010, SoCC '10.

[9]  Stratis Viglas,et al.  Flashing up the storage layer , 2008, Proc. VLDB Endow..

[10]  Jie Ma,et al.  Exploiting Data Deduplication to Accelerate Live Virtual Machine Migration , 2010, 2010 IEEE International Conference on Cluster Computing.

[11]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[12]  Irfan Ahmad,et al.  Decentralized Deduplication in SAN Cluster File Systems , 2009, USENIX Annual Technical Conference.

[13]  Hong Jiang,et al.  SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput , 2011, USENIX Annual Technical Conference.

[14]  George Varghese,et al.  Difference engine , 2010, OSDI.

[15]  Jin Li,et al.  ChunkStash: Speeding Up Inline Storage Deduplication Using Flash Memory , 2010, USENIX Annual Technical Conference.

[16]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[17]  André Brinkmann,et al.  dedupv1: Improving deduplication throughput using solid state drives (SSD) , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[18]  Cezary Dubnicki,et al.  HydraFS: A High-Throughput File System for the HYDRAstor Content-Addressable Storage System , 2010, FAST.

[19]  Dan Feng,et al.  SORT: A Similarity-Ownership Based Routing Scheme to Improve Data Read Performance for Deduplication Clusters , 2011 .

[20]  Yuanyuan Zhou,et al.  Hibernator: helping disk arrays sleep through the winter , 2005, SOSP '05.

[21]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[22]  Petros Efstathopoulos,et al.  Building a High-performance Deduplication System , 2011, USENIX Annual Technical Conference.

[23]  Mahadev Satyanarayanan,et al.  Design Tradeoffs in Applying Content Addressable Storage to Enterprise-scale Systems Based on Virtual Machines , 2006, USENIX Annual Technical Conference, General Track.

[24]  Darrell D. E. Long,et al.  Providing High Reliability in a Minimum Redundancy Archival Storage System , 2006, 14th IEEE International Symposium on Modeling, Analysis, and Simulation.

[25]  Raju Rangaswami,et al.  I/O Deduplication: Utilizing content similarity to improve I/O performance , 2010, TOS.

[26]  Timothy Bisson,et al.  iDedup: latency-aware, inline data deduplication for primary storage , 2012, FAST.

[27]  Steven Swanson,et al.  Gordon: using flash memory to build fast, power-efficient clusters for data-intensive applications , 2009, ASPLOS.

[28]  David Hung-Chang Du,et al.  Chunk Fragmentation Level: An Effective Indicator for Read Performance Degradation in Deduplication Storage , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[29]  Hong Jiang,et al.  DEBAR: A scalable high-performance de-duplication storage system for backup and archiving , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[30]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[31]  Kai Li,et al.  Tradeoffs in Scalable Data Routing for Deduplication Clusters , 2011, FAST.