ISABELA-QA: Query-driven analytics with ISABELA-compressed extreme-scale scientific data

Efficient analytics of scientific data from extreme-scale simulations is quickly becoming a top-notch priority. The increasing simulation output data sizes demand for a paradigm shift in how analytics is conducted. In this paper, we argue that query-driven analytics over compressed - rather than original, full-size - data is a promising strategy in order to meet storage-and-I/O-bound application challenges. As a proof-of-principle, we propose a parallel query processing engine, called ISABELA-QA that is designed and optimized for knowledge priors driven analytical processing of spatio-temporal, multivariate scientific data that is initially compressed, in situ, by our ISABELA technology. With ISABELA-QA, the total data storage requirement is less than 23%-30% of the original data, which is upto eight-fold less than what the existing state-of-the-art data management technologies that require storing both the original data and the index could offer. Since ISABELA-QA operates on the metadata generated by our compression technology, its underlying indexing technology for efficient query processing is light-weight; it requires less than 3% of the original data, unlike existing database indexing approaches that require 30%-300% of the original data. Moreover, ISABELA-QA is specifically optimized to retrieve the actual values rather than spatial regions for the variables that satisfy user-specified range queries - a functionality that is critical for high-accuracy data analytics. To the best of our knowledge, this is the first technology that enables query-driven analytics over the compressed spatio-temporal floating-point double- or single-precision data, while offering a light-weight memory and disk storage footprint solution with parallel, scalable, multi-node, multi-core, GPU-based query processing.

[1]  Arie Shoshani,et al.  On the performance of bitmap indices for high cardinality attributes , 2004, VLDB.

[2]  G. Antoshenkov,et al.  Byte-aligned bitmap compression , 1995, Proceedings DCC '95 Data Compression Conference.

[3]  Robert Wrembel,et al.  RLH: Bitmap compression technique based on run-length and Huffman encoding , 2009, Inf. Syst..

[4]  J. Manickam,et al.  Gyro-kinetic simulation of global turbulent transport properties in tokamak experiments , 2006 .

[5]  B. Fryxell,et al.  FLASH: An Adaptive Mesh Hydrodynamics Code for Modeling Astrophysical Thermonuclear Flashes , 2000 .

[6]  Prabhat,et al.  FastBit: interactively searching massive data , 2009 .

[7]  Michael Stonebraker,et al.  A Demonstration of SciDB: A Science-Oriented DBMS , 2009, Proc. VLDB Endow..

[8]  Paul G. Brown,et al.  Overview of sciDB: large scale array storage, processing and analysis , 2010, SIGMOD Conference.

[9]  Tao Tao,et al.  Compressing bitmap indices by data reorganization , 2005, 21st International Conference on Data Engineering (ICDE'05).

[10]  Oliver Rubel,et al.  Automatic Beam Path Analysis of Laser Wakefield Particle Acceleration Data , 2010 .

[11]  Martin L. Kersten,et al.  Breaking the memory wall in MonetDB , 2008, CACM.

[12]  Arie Shoshani,et al.  Compressing bitmap indexes for faster search operations , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[13]  Daniel J. Abadi,et al.  Column-stores vs. row-stores: how different are they really? , 2008, SIGMOD Conference.

[14]  Scott Klasky,et al.  Terascale direct numerical simulations of turbulent combustion using S3D , 2008 .

[15]  Carl de Boor,et al.  A Practical Guide to Splines , 1978, Applied Mathematical Sciences.

[16]  John Shalf,et al.  DEX: Increasing the Capability of Scientific Data Analysis Pipelines by Using Efficient Bitmap Indices to Accelerate Scientific Visualization , 2005, SSDBM.

[17]  Marianne Winslett,et al.  Finding regions of interest on toroidal meshes , 2011 .

[18]  Robert Latham,et al.  Compressing the Incompressible with ISABELA: In-situ Reduction of Spatio-temporal Data , 2011, Euro-Par.

[19]  Kesheng Wu,et al.  Bitmap Index Design Choices and Their Performance Implications , 2007, 11th International Database Engineering and Applications Symposium (IDEAS 2007).

[20]  Choong-Seock Chang,et al.  Full-f gyrokinetic particle simulation of centrally heated global ITG turbulence from magnetic axis to edge pedestal top in a realistic tokamak geometry , 2009 .

[21]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[22]  David S. Johnson,et al.  Compressing Large Boolean Matrices using Reordering Techniques , 2004, VLDB.

[23]  Ralf Hartmut Güting,et al.  An introduction to spatial database systems , 1994, VLDB J..