SDGen: Mimicking Datasets for Content Generation in Storage Benchmarks

Storage system benchmarks either use samples of proprietary data or synthesize artificial data in simple ways (such as using zeros or random data). However, many storage systems behave completely differently on such artificial data than they do on real-world data. This is the case with systems that include data reduction techniques, such as compression and/or deduplication. To address this problem, we propose a benchmarking methodology called mimicking and apply it in the domain of data compression. Our methodology is based on characterizing the properties of real data that influence the performance of compressors. Then, we use these characterizations to generate new synthetic data that mimics the real one in many aspects of compression. Unlike current solutions that only address the compression ratio of data, mimicking is flexible enough to also emulate compression times and data heterogeneity. We show that these properties matter to the system's performance. In our implementation, called SDGen, characterizations take at most 2.5KB per data chunk (e.g., 64KB) and can be used to efficiently share benchmarking data in a highly anonymized fashion; sharing it carries few or no privacy concerns. We evaluated our data generator's accuracy on compressibility and compression times using real-world datasets and multiple compressors (lz4, zlib, bzip2 and lzma). As a proof-of-concept, we integrated SDGen as a content generation layer in two popular benchmarks (LinkBench and Impressions).

[1]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[2]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[3]  Erez Zadok,et al.  Generating Realistic Datasets for Deduplication Analysis , 2012, USENIX Annual Technical Conference.

[4]  Y. C. Tay,et al.  Data generation for application-specific benchmarking , 2011, Proc. VLDB Endow..

[5]  Aiko Pras,et al.  Benchmarking personal cloud storage , 2013, Internet Measurement Conference.

[6]  Carlo Curino,et al.  OLTP-Bench: An Extensible Testbed for Benchmarking Relational Databases , 2013, Proc. VLDB Endow..

[7]  Yuqing Zhu,et al.  BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[8]  Nikolai Joukov,et al.  A nine year study of file system and storage benchmarking , 2008, TOS.

[9]  Margo I. Seltzer,et al.  The case for application-specific benchmarking , 1999, Proceedings of the Seventh Workshop on Hot Topics in Operating Systems.

[10]  Xiaowei Yang,et al.  CloudCmp: comparing public cloud providers , 2010, IMC '10.

[11]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[12]  Xing Xie,et al.  Mining interesting locations and travel sequences from GPS trajectories , 2009, WWW '09.

[13]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[14]  Daniel S. Hirschberg,et al.  Data compression , 1987, CSUR.

[15]  Dalit Naor,et al.  Estimation of deduplication ratios in large data sets , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[16]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[17]  Eric Anderson,et al.  Proceedings of the Third Usenix Conference on File and Storage Technologies Buttress: a Toolkit for Flexible and High Fidelity I/o Benchmarking , 2022 .

[18]  Andrea C. Arpaci-Dusseau,et al.  End-to-end Data Integrity for File Systems: A ZFS Case Study , 2010, FAST.

[19]  Butler W. Lampson,et al.  On-line data compression in a log-structured file system , 1992, ASPLOS V.

[20]  Andrea C. Arpaci-Dusseau,et al.  Generating realistic impressions for file-system benchmarking , 2009, TOS.

[21]  Yaguang Wang,et al.  COSBench: A Benchmark Tool for Cloud Object Storage Services , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[22]  Fred Douglis,et al.  Migratory compression: coarse-grained data reordering to improve compressibility , 2014, FAST.

[23]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[24]  Allon Adir,et al.  Dynamic Test Data Generation for Data Intensive Applications , 2011, Haifa Verification Conference.

[25]  Georg Lausen,et al.  SP2Bench: A SPARQL Performance Benchmark , 2008, Semantic Web Information Management.

[26]  Avishay Traeger,et al.  To Zip or not to Zip: effective resource usage for real-time compression , 2013, FAST.

[27]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[28]  George Goldberg,et al.  The case for sampling on very large file systems , 2014, 2014 30th Symposium on Mass Storage Systems and Technologies (MSST).

[29]  Timothy G. Armstrong,et al.  LinkBench: a database benchmark based on the Facebook social graph , 2013, SIGMOD '13.

[30]  Erez Zadok,et al.  Benchmarking File System Benchmarking: It *IS* Rocket Science , 2011, HotOS.

[31]  Georg Lausen,et al.  SP^2Bench: A SPARQL Performance Benchmark , 2008, 2009 IEEE 25th International Conference on Data Engineering.

[32]  Claude E. Shannon,et al.  The mathematical theory of communication , 1950 .

[33]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[34]  Erez Zadok,et al.  Extracting flexible, replayable models from large block traces , 2012, FAST.