A Secondary Index for Improving Reading Performance in the Inline Deduplication System

With the advent of cloud computing, huge amount of data are stored on the cloud. To remove the redundant data, the Deduplication technology is proposed and attracts many interests from both the academic and industry. In the Inline Deduplication System, improving the read performance is very vital. However, existing works is inefficient when massive data are involved. In this paper, we first propose a Secondary Index Assisted Read scheme (SIAR) in the Inline deduplication system. To reduce the frequency of disk access, and improve the read performance, we build a secondary index in the RAM and disks, which makes full use of the high random-read performance properties of RAM and the low cost of disk. Further, we analyze the trade-off between the performance improvement of SIAR and memory overhead, which makes the proposed scheme adaptive to different applications. Finally, we conduct extensive experiments, which confirms the efficiency and efficacy of SIAR.

[1]  Mark Lillibridge,et al.  Improving restore speed for backup systems that use inline chunk-based deduplication , 2013, FAST.

[2]  Hong Jiang,et al.  Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud , 2014, TOS.

[3]  Ratul Mahajan,et al.  Proceedings of the 2012 ACM conference on Internet measurement conference , 2012 .

[4]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[5]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[6]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[7]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[8]  Rudolf Bayer,et al.  Organization and maintenance of large ordered indexes , 1972, Acta Informatica.

[9]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[10]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[11]  Felix Hupfeld,et al.  BabuDB: Fast and Efficient File System Metadata Storage , 2010, 2010 International Workshop on Storage Network Architecture and Parallel I/Os.

[12]  Fred Douglis,et al.  Redundancy Elimination Within Large Collections of Files , 2004, USENIX Annual Technical Conference, General Track.

[13]  Timothy Bisson,et al.  iDedup: latency-aware, inline data deduplication for primary storage , 2012, FAST.

[14]  Mahadev Satyanarayanan,et al.  Opportunistic Use of Content Addressable Storage for Distributed File Systems , 2003, USENIX Annual Technical Conference, General Track.

[15]  André Brinkmann,et al.  File recipe compression in data deduplication systems , 2013, FAST.

[16]  Michal Kaczmarczyk,et al.  Reducing impact of data fragmentation caused by in-line deduplication , 2012, SYSTOR '12.

[17]  John Gantz,et al.  The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East , 2012 .

[18]  Ni Lar Thein,et al.  An Efficient Indexing Mechanism for Data Deduplication , 2009, 2009 International Conference on the Current Trends in Information Technology (CTIT).

[19]  Aiko Pras,et al.  Inside dropbox: understanding personal cloud storage services , 2012, Internet Measurement Conference.

[20]  Benny Pinkas,et al.  Proofs of ownership in remote storage systems , 2011, CCS '11.