HCompress: Hierarchical Data Compression for Multi-Tiered Storage Environments

Modern scientific applications read and write massive amounts of data through simulations, observations, and analysis. These applications spend the majority of their runtime in performing I/O. HPC storage solutions include fast node-local and shared storage resources to elevate applications from this bottleneck. Moreover, several middleware libraries (e.g., Hermes) are proposed to move data between these tiers transparently. Data reduction is another technique that reduces the amount of data produced and, hence, improve I/O performance. These two technologies, if used together, can benefit from each other. The effectiveness of data compression can be enhanced by selecting different compression algorithms according to the characteristics of the different tiers, and the multi-tiered hierarchy can benefit from extra capacity. In this paper, we design and implement HCompress, a hierarchical data compression library that can improve the application’s performance by harmoniously leveraging both multi-tiered storage and data compression. We have developed a novel compression selection algorithm that facilitates the optimal matching of compression libraries to the tiered storage. Our evaluation shows that HCompress can improve scientific application’s performance by 7x when compared to other state-of-the-art tiered storage solutions.

[1]  Karsten Schwan,et al.  DataStager: scalable data staging services for petascale applications , 2009, HPDC '09.

[2]  Marta Mattoso,et al.  Scientific Data Analysis Using Data-Intensive Scalable Computing: The SciDISC Project , 2018, LADaS@VLDB.

[3]  Mohammad Hossain Heydari,et al.  Content based file type detection algorithms , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[4]  Surendra Byna,et al.  Parallel I/O prefetching using MPI file caching and I/O signatures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Zhiqiang Lin,et al.  Type Inference on Executables , 2016, ACM Comput. Surv..

[6]  R. Kitchin,et al.  Big Data, new epistemologies and paradigm shifts , 2014, Big Data Soc..

[7]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[8]  Wei-keng Liao,et al.  Dynamically adapting file domain partitioning methods for collective I/O based on underlying parallel file system locking protocols , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  David Salomon,et al.  Data Compression: The Complete Reference , 2006 .

[10]  Mark A. Roth,et al.  Database compression , 1993, SGMD.

[11]  Fan Zhang,et al.  Exploring Data Staging Across Deep Memory Hierarchies for Coupled Data Intensive Simulation Workflows , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[12]  Laura B. Newburgh,et al.  A compression scheme for radio data in high performance computing , 2015, Astron. Comput..

[13]  Hassan K. Reghbati,et al.  Special Feature An Overview of Data Compression Techniques , 1981, Computer.

[14]  Albert Y. Zomaya,et al.  Remote sensing big data computing: Challenges and opportunities , 2015, Future Gener. Comput. Syst..

[15]  Houjun Tang,et al.  UniviStor: Integrated Hierarchical and Distributed Storage for HPC , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[16]  Houjun Tang,et al.  Toward Scalable and Asynchronous Object-Centric Data Management for HPC , 2018, 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[17]  Avita Katal,et al.  Big data: Issues, challenges, tools and Good practices , 2013, 2013 Sixth International Conference on Contemporary Computing (IC3).

[18]  Arie Shoshani,et al.  Parallel I/O, analysis, and visualization of a trillion particle simulation , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  M. Breitwisch Phase Change Memory , 2008, 2008 International Interconnect Technology Conference.

[20]  Kevin Harms,et al.  Scalable Parallel I/O on a Blue Gene/Q Supercomputer Using Compression, Topology-Aware Data Aggregation, and Subfiling , 2014, 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[21]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[22]  Gerd Heber,et al.  An overview of the HDF5 technology suite and its applications , 2011, AD '11.

[23]  Hangu Yeo,et al.  Big Data: Cloud computing in genomics applications , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[24]  G. Blelloch Introduction to Data Compression * , 2022 .

[25]  André Brinkmann,et al.  A study on data deduplication in HPC storage systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[26]  Divesh Srivastava,et al.  Semantic Data Caching and Replacement , 1996, VLDB.

[27]  F. Pellizzer,et al.  Novel /spl mu/trench phase-change memory cell for embedded and stand-alone non-volatile memory applications , 2004, Digest of Technical Papers. 2004 Symposium on VLSI Technology, 2004..

[28]  José Carlos Brustoloni,et al.  Effects of buffering semantics on I/O performance , 1996, OSDI '96.

[29]  Jin Xiong,et al.  I/O Characterization of Big Data Workloads in Data Centers , 2014, BPOE@ASPLOS/VLDB.

[30]  Mariana Vertenstein,et al.  A methodology for evaluating the impact of data compression on climate simulation data , 2014, HPDC '14.

[31]  Peter Lindstrom,et al.  Assessing the effects of data compression in simulations using physically motivated metrics , 2013, SC.

[32]  Gunther H. Weber,et al.  Scientific workflows at DataWarp-speed: Accelerated data-intensive science using Nersc's burst buffer , 2017 .

[33]  Brian van Straalen,et al.  Scientific Workflows at DataWarp-Speed: Accelerated Data-Intensive Science Using NERSC's Burst Buffer , 2016, 2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS).

[34]  Prabhat,et al.  Storage 2020: A Vision for the Future of HPC Storage , 2017 .

[35]  Jun Yang,et al.  Data Management in Machine Learning: Challenges, Techniques, and Systems , 2017, SIGMOD Conference.

[36]  Jack J. Dongarra,et al.  Exascale computing and big data , 2015, Commun. ACM.

[37]  Surendra Byna,et al.  BD-CATS: big data clustering at trillion particle scale , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[38]  Kim H. Esbensen,et al.  Representative Sampling for reliable data analysis: Theory Of Sampling , 2005 .

[39]  Xian-He Sun,et al.  An Intelligent, Adaptive, and Flexible Data Compression Framework , 2019, 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[40]  Yorgos J. Stephanedes,et al.  Application of Filtering Techniques for Incident Detection , 1993 .

[41]  Xian-He Sun,et al.  Hermes: a heterogeneous-aware multi-tiered distributed I/O buffering system , 2018, HPDC.

[42]  Xian-He Sun,et al.  Harmonia: An Interference-Aware Dynamic I/O Scheduler for Shared Non-volatile Burst Buffers , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[43]  Volkan Cevher,et al.  Compressible Distributions for High-Dimensional Statistics , 2011, IEEE Transactions on Information Theory.

[44]  S. O. Park,et al.  Highly scalable nonvolatile resistive memory using simple binary oxide driven by asymmetric unipolar voltage pulses , 2004, IEDM Technical Digest. IEEE International Electron Devices Meeting, 2004..

[45]  Mahmut T. Kandemir,et al.  A case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling flexible data compression with assist warps , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[46]  Robert B. Ross,et al.  On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[47]  Jian Yin,et al.  Virtual chunks: On supporting random accesses to scientific data in compressible storage systems , 2014, 2014 IEEE International Conference on Big Data (Big Data).