POD: Performance Oriented I/O Deduplication for Primary Storage Systems in the Cloud

Recent studies have shown that moderate to high data redundancy clearly exists in primary storage systems in the Cloud. Our experimental studies reveal that data redundancy exhibits a much higher level of intensity on the I/O path than that on disks due to the relatively high temporal access locality associated with small I/O requests to redundant data. On the other hand, we also observe that directly applying data deduplication to primary storage systems in the Cloud will likely cause space contention in memory and data fragmentation on disks. Based on these observations, we propose a Performance-Oriented I/O Deduplication approach, called POD, rather than a capacity-oriented I/O deduplication approach, represented by iDedup, to improve the I/O performance of primary storage systems in the Cloud without sacrificing capacity savings of the latter. The salient feature of POD is its focus on not only the capacity-sensitive large writes and files, as in iDedup, but also the performance-sensitive while capacity-insensitive small writes and files. The experiments conducted on our lightweight prototype implementation of POD show that POD significantly outperforms iDedup in the I/O performance measure by up to 87.9% with an average of 58.8%. Moreover, our evaluation results also show that POD achieves comparable or better capacity savings than iDedup.

[1]  Jacob R. Lorch,et al.  A five-year study of file-system metadata , 2007, TOS.

[2]  Tian Luo,et al.  CAFTL: A Content-Aware Flash Translation Layer Enhancing the Lifespan of Flash Memory based Solid State Drives , 2011, FAST.

[3]  Anne-Marie Kermarrec,et al.  Probabilistic deduplication for cluster-based storage systems , 2012, SoCC '12.

[4]  Sudipta Sengupta,et al.  Primary Data Deduplication - Large Scale Study and System Design , 2012, USENIX Annual Technical Conference.

[5]  Jim Zelenka,et al.  Informed prefetching and caching , 1995, SOSP.

[6]  Dutch T. Meyer,et al.  A study of practical deduplication , 2011, TOS.

[7]  Stephanie N. Jones Online De-duplication in a Log-Structured File System for Primary Storage , 2011 .

[8]  Mark Lillibridge,et al.  Improving restore speed for backup systems that use inline chunk-based deduplication , 2013, FAST.

[9]  Anand Sivasubramaniam,et al.  Leveraging Value Locality in Optimizing NAND Flash-based SSDs , 2011, FAST.

[10]  John C. S. Lui,et al.  Live Deduplication Storage of Virtual Machine Images in an Open-Source Cloud , 2011, Middleware.

[11]  Hong Jiang,et al.  HPDA: A hybrid parity-based disk array for enhanced performance and reliability , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[12]  Matei Ripeanu,et al.  stdchk: A Checkpoint Storage System for Desktop Grid Computing , 2007, 2008 The 28th International Conference on Distributed Computing Systems.

[13]  Philip C. Roth,et al.  Characterizing the I/O behavior of scientific applications on the Cray XT , 2007, PDSW '07.

[14]  Hong Jiang,et al.  SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput , 2011, USENIX Annual Technical Conference.

[15]  Hong Jiang,et al.  SAR: SSD Assisted Restore Optimization for Deduplication-Based Storage Systems in the Cloud , 2012, 2012 IEEE Seventh International Conference on Networking, Architecture, and Storage.

[16]  Karsten Schwan,et al.  Six degrees of scientific data: reading patterns for extreme scale science IO , 2011, HPDC '11.

[17]  Kai Li,et al.  Tradeoffs in Scalable Data Routing for Deduplication Clusters , 2011, FAST.

[18]  Ethan L. Miller,et al.  The effectiveness of deduplication on virtual machine disk images , 2009, SYSTOR '09.

[19]  Hong Jiang,et al.  IDO: Intelligent Data Outsourcing with Improved RAID Reconstruction Performance in Large-Scale Data Centers , 2012, LISA.

[20]  Alma Riska,et al.  Disk Drive Level Workload Characterization , 2006, USENIX Annual Technical Conference, General Track.

[21]  Randal C. Burns,et al.  AWOL: An Adaptive Write Optimizations Layer , 2008, FAST.

[22]  Robert Latham,et al.  Understanding and improving computational science storage access through continuous characterization , 2011, MSST.

[23]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[24]  Sean D Dessureault,et al.  Understanding big data , 2016 .

[25]  Irfan Ahmad,et al.  Decentralized Deduplication in SAN Cluster File Systems , 2009, USENIX Annual Technical Conference.

[26]  David G. Andersen,et al.  Using vector interfaces to deliver millions of IOPS from a networked key-value storage server , 2012, SoCC '12.

[27]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[28]  André Brinkmann,et al.  A study on data deduplication in HPC storage systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  Jongmoo Choi,et al.  Caching less for better performance: balancing cache size and update cost of flash memory cache in hybrid storage systems , 2012, FAST.

[30]  Jie Ma,et al.  Exploiting Data Deduplication to Accelerate Live Virtual Machine Migration , 2010, 2010 IEEE International Conference on Cluster Computing.

[31]  Hong Jiang,et al.  RoLo: A Rotated Logging Storage Architecture for Enterprise Data Centers , 2010, 2010 IEEE 30th International Conference on Distributed Computing Systems.

[32]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[33]  Timothy Bisson,et al.  iDedup: latency-aware, inline data deduplication for primary storage , 2012, FAST.

[34]  Raju Rangaswami,et al.  I/O Deduplication: Utilizing content similarity to improve I/O performance , 2010, TOS.

[35]  Nimrod Megiddo,et al.  ARC: A Self-Tuning, Low Overhead Replacement Cache , 2003, FAST.