Data Reduction Analysis for Climate Data Sets

Global climate modeling not only requires computation capabilities, but also brings tough challenges for data storage systems. The input and output data sets generally require hundreds or even thousands of terabytes storage. Therefore, storage reduction methods, such as content deduplication and various data compression methods, are extremely important for reducing the storage size requirement in climate modeling. However, little work has been done on investigating the effectiveness of these data reduction methods for climate data sets. In this paper, the potential benefit of data reduction for climate data is studied by investigating a total of 46.5 TB climate data sets, including 3 observation data sets (14.1 TB) and 3 climate model output data sets (32.4 TB). Five different data compression algorithms and two types of content deduplication mechanisms are applied to these data sets to study the possible data reduction effectiveness. Further more, the compressibility of different climate component data is also examined. Our work demonstrates the potential of applying data reduction methods in climate modeling platforms, and provides guidance for selecting the suitable methods for different kinds of climate data sets. We find that the compression method $${LCFP}$$LCFP can provide the best compression ratio; however, its throughputs, especially the inflate throughputs are much lower than all the others. To strike a better balance between compression ratio and throughputs, we propose a new compression method for the model output data. The new compression method can achieve comparable compression ratio, while attain about 20 times higher inflate throughput than that of $${LCFP}$$LCFP.

[1]  Ethan L. Miller,et al.  The effectiveness of deduplication on virtual machine disk images , 2009, SYSTOR '09.

[2]  Darrell D. E. Long,et al.  Duplicate Data Elimination in a SAN File System , 2004, MSST.

[3]  PenShu Yeh Implementation of CCSDS Lossless Data Compression for Space and Data Archive Applications , 2002 .

[4]  Kave Eshghi,et al.  A Framework for Analyzing and Improving Content-Based Chunking Algorithms , 2005 .

[5]  Martin Isenburg,et al.  Lossless compression of predicted floating-point geometry , 2005, Comput. Aided Des..

[6]  R. F. Rice,et al.  Practical Universal Noiseless Coding , 1979, Optics & Photonics.

[7]  Robert Latham,et al.  Compressing the Incompressible with ISABELA: In-situ Reduction of Spatio-temporal Data , 2011, Euro-Par.

[8]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[9]  Robert B. Ross,et al.  Improving I/O Forwarding Throughput with Data Compression , 2011, 2011 IEEE International Conference on Cluster Computing.

[10]  Christian Steinruecken,et al.  Lossless Data Compression , 2009, Encyclopedia of Database Systems.

[11]  David J. Lilja,et al.  Characterizing datasets for data deduplication in backup applications , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[12]  Jarek Rossignac,et al.  Out‐of‐core compression and decompression of large n‐dimensional scalar fields , 2003, Comput. Graph. Forum.

[13]  David D. Chambliss,et al.  Mixing Deduplication and Compression on Active Data Sets , 2011, 2011 Data Compression Conference.

[14]  Jörg Schmalzl,et al.  Using standard image compression algorithms to store data from computational fluid dynamics , 2003 .

[15]  J. Overpeck,et al.  Climate Data Challenges in the 21st Century , 2011, Science.

[16]  André Brinkmann,et al.  Multi-level comparison of data deduplication in a backup scenario , 2009, SYSTOR '09.

[17]  Timothy Bisson,et al.  iDedup: latency-aware, inline data deduplication for primary storage , 2012, FAST.

[18]  Paul Wessel,et al.  Compression of large data grids for internet transmission , 2003 .

[19]  John H. Day,et al.  Implementation of CCSDS Lossless Data Compression in HDF , 2002 .

[20]  Fred Douglis,et al.  Characteristics of backup workloads in production systems , 2012, FAST.

[21]  Robert B. Ross,et al.  ISOBAR hybrid compression-I/O interleaving for large-scale parallel I/O optimization , 2012, HPDC '12.

[22]  Karl E. Taylor,et al.  An overview of CMIP5 and the experiment design , 2012 .

[23]  Maohua Lu,et al.  Insights for data reduction in primary storage: a practical analysis , 2012, SYSTOR '12.

[24]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[25]  Kwan-Liu Ma,et al.  Application-Driven Compression for Visualizing Large-Scale Time-Varying Data , 2010, IEEE Computer Graphics and Applications.

[26]  André Brinkmann,et al.  A study on data deduplication in HPC storage systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Martin Burtscher,et al.  FPC: A High-Speed Compressor for Double-Precision Floating-Point Data , 2009, IEEE Transactions on Computers.

[28]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[29]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[30]  Fred Douglis,et al.  Redundancy Elimination Within Large Collections of Files , 2004, USENIX Annual Technical Conference, General Track.