DupHunter: Flexible High-Performance Deduplication for Docker Registries

The rise of containers has led to a broad proliferation of container images. The associated storage performance and capacity requirements place high pressure on the infrastructure of container registries that store and serve images. Exploiting the high file redundancy in real-world container images is a promising approach to drastically reduce the demanding storage requirements of the growing registries. However, existing deduplication techniques significantly degrade the performance of registries because of the high layer restore overhead. We propose DupHunter, a new Docker registry architecture, which not only natively deduplicates layers for space savings but also reduces layer restore overhead. DupHunter supports several configurable deduplication modes, which provide different levels of storage efficiency, durability, and performance, to support a range of uses. To mitigate the negative impact of deduplication on the image download times, DupHunter introduces a two-tier storage hierarchy with a novel layer prefetch/preconstruct cache algorithm based on user access patterns. Under real workloads, in the highest data reduction mode, DupHunter reduces storage space by up to 6.9× compared to the current implementations. In the highest performance mode, DupHunter can reduce the GET layer latency up to 2.8× compared to the state of the art.

[1]  Ali R. Butt,et al.  Bolt: Towards a Scalable Docker Registry via Hyperconvergence , 2019, 2019 IEEE 12th International Conference on Cloud Computing (CLOUD).

[2]  Pramod Bhatotia,et al.  Cntr: Lightweight OS Containers , 2018, USENIX Annual Technical Conference.

[3]  Mark Lillibridge,et al.  Improving restore speed for backup systems that use inline chunk-based deduplication , 2013, FAST.

[4]  João Paulo,et al.  A Survey and Classification of Storage Deduplication Systems , 2014, ACM Comput. Surv..

[5]  F. Moore,et al.  Polynomial Codes Over Certain Finite Fields , 2017 .

[6]  Andrea C. Arpaci-Dusseau,et al.  Slacker: Fast Distribution with Lazy Docker Containers , 2016, FAST.

[7]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[8]  Constantin Adam,et al.  Optimizing Service Delivery with Minimal Runtimes , 2017, ICSOC Workshops.

[9]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[10]  Lisa Gerhardt,et al.  Shifter: Containers for HPC , 2017 .

[11]  Ethan L. Miller,et al.  The effectiveness of deduplication on virtual machine disk images , 2009, SYSTOR '09.

[12]  Yucheng Zhang,et al.  Design Tradeoffs for Data Deduplication Performance in Backup Workloads , 2015, FAST.

[13]  Richard P. Spillane,et al.  Exo-clones: Better Container Runtime Image Management across the Clouds , 2016, HotStorage.

[14]  Maohua Lu,et al.  Insights for data reduction in primary storage: a practical analysis , 2012, SYSTOR '12.

[15]  Fred Douglis,et al.  Characteristics of backup workloads in production systems , 2012, FAST.

[16]  Philip Shilane,et al.  Characterization of Incremental Data Changes for Efficient Data Protection , 2013, USENIX Annual Technical Conference.

[17]  A. Upadhyay,et al.  Deduplication and compression techniques in cloud design , 2012, 2012 IEEE International Systems Conference SysCon 2012.

[18]  Erez Zadok,et al.  Dmdedup : Device Mapper Target for Data Deduplication , 2014 .

[19]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[20]  Tao Li,et al.  Characterizing the efficiency of data deduplication for big data storage management , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[21]  Vasily Tarasov,et al.  Carving Perfect Layers out of Docker Images , 2019, HotCloud.

[22]  Gerhard Weikum,et al.  The LRU-K page replacement algorithm for database disk buffering , 1993, SIGMOD Conference.

[23]  Krishan Kumar,et al.  Economically Efficient Virtualization over Cloud Using Docker Containers , 2016, 2016 IEEE International Conference on Cloud Computing in Emerging Markets (CCEM).

[24]  Kang-Won Lee,et al.  Design of Global Data Deduplication for a Scale-Out Distributed Storage System , 2018, 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS).

[25]  Mohamed Mohamed,et al.  Improving Docker Registry Design Based on Production Workload Analysis , 2018, FAST.

[26]  Erez Zadok,et al.  A long-term user-centric analysis of deduplication patterns , 2016, 2016 32nd Symposium on Mass Storage Systems and Technologies (MSST).

[27]  Nimrod Megiddo,et al.  ARC: A Self-Tuning, Low Overhead Replacement Cache , 2003, FAST.

[28]  Mohamed Mohamed,et al.  In Search of the Ideal Storage Configuration for Docker Containers , 2017, 2017 IEEE 2nd International Workshops on Foundations and Applications of Self* Systems (FAS*W).

[29]  Ali R. Butt,et al.  Large-Scale Analysis of the Docker Hub Dataset , 2019, 2019 IEEE International Conference on Cluster Computing (CLUSTER).

[30]  Mario Blaum,et al.  SD codes: erasure codes designed for how storage systems really fail , 2013, FAST.

[31]  Dan Feng,et al.  Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information , 2014, USENIX Annual Technical Conference.

[32]  Fang Liu,et al.  AA-Dedupe: An Application-Aware Source Deduplication Approach for Cloud Backup Services in the Personal Computing Environment , 2011, 2011 IEEE International Conference on Cluster Computing.

[33]  David Hung-Chang Du,et al.  ALACC: Accelerating Restore Performance of Data Deduplication Systems Using Adaptive Look-Ahead Window Assisted Chunk Caching , 2018, FAST.

[34]  Philip Shilane,et al.  99 Deduplication Problems , 2016, HotStorage.

[35]  André Brinkmann,et al.  A study on data deduplication in HPC storage systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[36]  Ole Agesen,et al.  A comparison of software and hardware techniques for x86 virtualization , 2006, ASPLOS XII.

[37]  Somesh Jha,et al.  Cimplifier: automatically debloating containers , 2017, ESEC/SIGSOFT FSE.

[38]  Karl Aberer,et al.  A self-organized, fault-tolerant and scalable replication scheme for cloud storage , 2010, SoCC '10.

[39]  Timothy Bisson,et al.  iDedup: latency-aware, inline data deduplication for primary storage , 2012, FAST.

[40]  Chunyi Peng,et al.  An empirical analysis of similarity in virtual machine images , 2011, Middleware '11.