Parallel index and query for large scale data analysis

Modern scientific datasets present numerous data management and analysis challenges. State-of-the-art index and query technologies are critical for facilitating interactive exploration of large datasets, but numerous challenges remain in terms of designing a system for processing general scientific datasets. The system needs to be able to run on distributed multi-core platforms, efficiently utilize underlying I/O infrastructure, and scale to massive datasets. We present FastQuery, a novel software framework that address these challenges. FastQuery utilizes a state-of-the- art index and query technology (FastBit) and is designed to process massive datasets on modern supercomputing plat- forms. We apply FastQuery to processing of a massive 50TB dataset generated by a large scale accelerator modeling code. We demonstrate the scalability of the tool to 11,520 cores. Motivated by the scientific need to search for interesting particles in this dataset, we use our framework to reduce search time from hours to tens of seconds.

[1]  David J. DeWitt,et al.  A Single-User Performance Evaluation of the Teradata Database Machine , 1987, HPTS.

[2]  Arie Shoshani,et al.  Analyses of multi-level and multi-component compressed bitmap indexes , 2010, TODS.

[3]  Terence Critchlow,et al.  Practical lessons in supporting large-scale computational science , 1999, SGMD.

[4]  Arie Shoshani,et al.  Optimizing bitmap indices with efficient compression , 2006, TODS.

[5]  K. Stockinger,et al.  Detecting Distributed Scans Using High-Performance Query-Driven Visualization , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[6]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[7]  Patrick E. O'Neil,et al.  Model 204 Architecture and Performance , 1987, HPTS.

[8]  Hongjun Lu,et al.  T-tree or B-tree: main memory database index structure revisited , 2000, Proceedings 11th Australasian Database Conference. ADC 2000 (Cat. No.PR00528).

[9]  GraefeGoetz Query evaluation techniques for large databases , 1993 .

[10]  Prabhat,et al.  FastBit: interactively searching massive data , 2009 .

[11]  WuKesheng,et al.  Analyses of multi-level and multi-component compressed bitmap indexes , 2008 .

[12]  Michael Stonebraker,et al.  Requirements for Science Data Bases and SciDB , 2009, CIDR.

[13]  Michael Stonebraker,et al.  The Case for Shared Nothing , 1985, HPTS.

[14]  Prabhat,et al.  High performance multivariate visual data exploration for extremely large data , 2008, HiPC 2008.

[15]  John Shalf,et al.  HDF5-FastQuery: Accelerating Complex Queries on HDF Datasets using Fast Bitmap Indices , 2005, 18th International Conference on Scientific and Statistical Database Management (SSDBM'06).

[16]  Elizabeth O'Neil,et al.  Database--Principles, Programming, and Performance , 1994 .

[17]  William Gropp,et al.  Mpi the complete reference: the mpi-2 extensions , 1998 .

[18]  Arie Shoshani,et al.  Scientific Data Management - Challenges, Technology, and Deployment , 2009, Scientific Data Management.

[19]  Marcos K. Aguilera,et al.  A practical scalable distributed B-tree , 2008, Proc. VLDB Endow..

[20]  Marcin Zukowski,et al.  MonetDB/X100: Hyper-Pipelining Query Execution , 2005, CIDR.

[21]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[22]  Hans Hagen,et al.  High performance multivariate visual data exploration for extremely large data , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  M. Venturini,et al.  High resolution simulation of beam dynamics in electron linacs for x-ray free electron lasers , 2009 .

[24]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[25]  John Shalf,et al.  Tuning HDF5 for Lustre File Systems , 2010 .

[26]  Michael Stonebraker,et al.  Implementation techniques for main memory database systems , 1984, SIGMOD '84.

[27]  Gregory M. Nielson in Scientific and Engineering Computation , 1991 .

[28]  Kevin W. Boyack,et al.  Data-centric computing with the Netezza architecture. , 2006 .

[29]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[30]  Howard A. Padmore,et al.  Design studies for a next generation light source facility at LBNL , 2011 .

[31]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.