Design of FastQuery: How to Generalize Indexing and Querying System for Scientific Data

Modern scientific datasets present numerous data management and analysis challenges. State-of-the-art index and query technologies such as FastBit are critical for facilitating interactive exploration of large datasets. These technologies rely on adding auxiliary information to existing datasets to accelerate query processing. To use these indices, we need to match the relational data model used by the indexing systems with the array data model used by most scientific data, and to provide an efficient input and output layer for reading and writing the indices. In this work, we present a flexible design that can be easily applied to most scientific data formats. We demonstrate this flexibility by applying it to two of the most commonly used scientific data formats, HDF5 and NetCDF. We present two case studies using simulation data from the particle accelerator and climate simulation communities. To demonstrate the effectiveness of the new design, we also present a detailed performance study using both synthetic and real scientific workloads.

[1]  Yannis E. Ioannidis,et al.  Bitmap index design and evaluation , 1998, SIGMOD '98.

[2]  Arie Shoshani,et al.  Optimizing bitmap indices with efficient compression , 2006, TODS.

[3]  M. Dettinger,et al.  Flooding on California's Russian River: Role of atmospheric rivers , 2006 .

[4]  John Shalf,et al.  HDF5-FastQuery: Accelerating Complex Queries on HDF Datasets using Fast Bitmap Indices , 2005, 18th International Conference on Scientific and Statistical Database Management (SSDBM'06).

[5]  Patrick E. O'Neil,et al.  Model 204 Architecture and Performance , 1987, HPTS.

[6]  Prabhat,et al.  FastBit: interactively searching massive data , 2009 .

[7]  J. Cary,et al.  High-quality electron beams from a laser wakefield accelerator using plasma-channel guiding , 2004, Nature.

[8]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[9]  Ying-Hwa Kuo,et al.  Diagnosis of an Intense Atmospheric River Impacting the Pacific Northwest: Storm Summary and Offshore Vertical Structure Observed with COSMIC Satellite Retrievals , 2008 .

[10]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[11]  Arie Shoshani,et al.  Scientific Data Management - Challenges, Technology, and Deployment , 2009, Scientific Data Management.

[12]  Terence Critchlow,et al.  Practical lessons in supporting large-scale computational science , 1999, SGMD.

[13]  Arie Shoshani,et al.  Analyses of multi-level and multi-component compressed bitmap indexes , 2010, TODS.

[14]  Hans Hagen,et al.  High performance multivariate visual data exploration for extremely large data , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Leonid Oliker,et al.  Towards Ultra-High Resolution Models of Climate and Weather , 2008, Int. J. High Perform. Comput. Appl..