Improving I/O Performance with Adaptive Data Compression for Big Data Applications

Increasingly larger scale simulations are generating an unprecedented amount of data. However, the increasing gap between computation and I/O capacity on High End Computing machines makes a severe bottleneck for data analysis. As a solution, in-situ analytics processes output data while simulations are running and before placing data on disk. Data movement between simulation and analytics, however, incurs overheads of in-situ analytics at scale. This paper tries to answer the following question: can we use compression technology to reduce the data movement cost and improve the performance of in-situ analytics for peta-scale applications? In particular, we explore when, where, how to use the compression techniques to reduce data movement cost between simulation and analytics. To find out the best algorithm and place to compress data in given situation, we introduce an adaptive data compression algorithm in this paper. The adaptive compression service is developed and analyzed for the in-situ analytics middleware. Experimental results demonstrate that compression service increases data transition bandwidth and improve the application End-to-End transfer performance.

[1]  G. Blelloch Introduction to Data Compression * , 2022 .

[2]  P. Balaji,et al.  GePSeA: A General-Purpose Software Acceleration Framework for Lightweight Task Offloading , 2009, 2009 International Conference on Parallel Processing.

[3]  Kenneth Moreland,et al.  Sandia National Laboratories , 2000 .

[4]  Scott Klasky,et al.  A Maya use case: adaptable scientific workflows with ADIOS for general relativistic astrophysics , 2013, XSEDE.

[5]  Siyuan Ma,et al.  A Source-aware Interrupt Scheduling for Modern Parallel I/O Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[6]  G.G. Langdon,et al.  Data compression , 1988, IEEE Potentials.

[7]  Jarek Nieplocha,et al.  Evaluation of active storage strategies for the lustre parallel file system , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[8]  Zhiling Lan,et al.  Reducing Fragmentation on Torus-Connected Supercomputers , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[9]  Christian H. Bischof,et al.  VIRACOCHA: An Efficient Parallelization Framework for Large-Scale CFD Post-Processing in Virtual Environments , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[10]  Bogdan Nicolae,et al.  High Throughput Data-Compression for Cloud Storage , 2010, Globe.

[11]  Robert B. Ross,et al.  Improving I/O Forwarding Throughput with Data Compression , 2011, 2011 IEEE International Conference on Cluster Computing.

[12]  Hai Jin,et al.  A virtual-service-domain based bidding algorithm for resource discovery in computational grid , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[13]  Jingjin Wu,et al.  Improving Parallel IO Performance of Cell-based AMR Cosmology Applications , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[14]  Hai Jin,et al.  HRTC: hybrid resource information service architecture based on GMA , 2005, IEEE International Conference on e-Business Engineering (ICEBE'05).

[15]  Cong Du,et al.  V-MCS: A configuration system for virtual machines , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[16]  Michael E. Papka,et al.  In situ data analysis and I / O acceleration of FLASH astrophysics simulation on leadership-class system using GLEAN , 2011 .

[17]  Xian-He Sun,et al.  An evaluation of parallel optimization for OpenSolaris® network stack , 2010, IEEE Local Computer Network Conference.

[18]  Andreas Wilke,et al.  A scalable data analysis platform for metagenomics , 2013, 2013 IEEE International Conference on Big Data.

[19]  Emmanuel Jeannot,et al.  Adaptive online data compression , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[20]  Karsten Schwan,et al.  FlexIO : Location-flexible Execution of In Situ Data Analytics for Large Scale Scientific Applications , 2012 .

[21]  Jingjin Wu,et al.  Performance Emulation of Cell-Based AMR Cosmology Simulations , 2011, 2011 IEEE International Conference on Cluster Computing.

[22]  Karsten Schwan,et al.  Quality-Aware Data Management for Large Scale Scientific Applications , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[23]  Ioannis Kontoyiannis,et al.  Pattern matching and lossy data compression on random fields , 2003, IEEE Trans. Inf. Theory.

[24]  Karsten Schwan,et al.  FlexIO: I/O Middleware for Location-Flexible Scientific Data Analytics , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[25]  Karsten Schwan,et al.  FlexQuery: An online query system for interactive remote visual data exploration at large scale , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[26]  J. Manickam,et al.  Gyro-kinetic simulation of global turbulent transport properties in tokamak experiments , 2006 .

[27]  Archana Ganapathi,et al.  To compress or not to compress - compute vs. IO tradeoffs for mapreduce energy efficiency , 2010, Green Networking '10.

[28]  Fei Meng,et al.  Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  Jingjin Wu,et al.  A Transparent Collective I/O Implementation , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[30]  Seetharami R. Seelam,et al.  Modeling the Impact of Checkpoints on Next-Generation Systems , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).

[31]  Prabhat,et al.  FastBit: interactively searching massive data , 2009 .

[32]  Robert Latham,et al.  Compressing the Incompressible with ISABELA: In-situ Reduction of Spatio-temporal Data , 2011, Euro-Par.