Optimizing the Query Performance of Block Index Through Data Analysis and I/O Modeling

Indexing technique has become an efficient tool to enable scientists to directly access the most relevant data records. But, the time and space requirements of building and storing indexes are expensive in the traditional approaches, such as R-tree and bitmaps. Recently, we started to address this issue by using the idea of "block index", and our previous work has shown promising results from comparing it against other well-known solutions, including ADIOS, SciDB, and FastBit. In this work, we further improve the technique from both theoretical and implementation perspectives. Driven by an extensive effort in characterizing scientific datasets and modeling I/O systems, we presented a theoretical model to analyze its query performance with respect to a given block size configuration. We also introduced three optimization techniques to achieve a 2.3x query time reduction comparing to the original implementation.

[1]  Kwan-Liu Ma,et al.  In Situ Visualization at Extreme Scale: Challenges and Opportunities , 2009, IEEE Computer Graphics and Applications.

[2]  Arie Shoshani,et al.  In situ data processing for extreme-scale computing , 2011 .

[3]  Michael Stonebraker,et al.  A Demonstration of SciDB: A Science-Oriented DBMS , 2009, Proc. VLDB Endow..

[4]  Arie Shoshani,et al.  Scientific Data Management - Challenges, Technology, and Deployment , 2009, Scientific Data Management.

[5]  Prabhat,et al.  FastBit: interactively searching massive data , 2009 .

[6]  David R. O'Hallaron,et al.  Remote runtime steering of integrated terascale simulation and visualization , 2006, SC.

[7]  K. Bowers,et al.  Ultrahigh performance three-dimensional electromagnetic relativistic kinetic plasma simulationa) , 2008 .

[8]  Surendra Byna,et al.  Spatially clustered join on heterogeneous scientific data sets , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[9]  Surendra Byna,et al.  Taming parallel I/O complexity with auto-tuning , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Elizabeth O'Neil,et al.  Database--Principles, Programming, and Performance , 1994 .

[11]  Kesheng Wu,et al.  Indexing Blocks to Reduce Space and Time Requirements for Searching Large Data Files , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[12]  Xiaocheng Zou,et al.  Scalable in situ scientific data encoding for analytical query processing , 2013, HPDC.

[13]  Patrick E. O'Neil,et al.  Model 204 Architecture and Performance , 1987, HPTS.

[14]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[15]  Arie Shoshani,et al.  Parallel index and query for large scale data analysis , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[16]  Kevin W. Boyack,et al.  Data-centric computing with the Netezza architecture. , 2006 .

[17]  K. Stockinger,et al.  Detecting Distributed Scans Using High-Performance Query-Driven Visualization , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[18]  Robert Latham,et al.  ISABELA-QA: Query-driven analytics with ISABELA-compressed extreme-scale scientific data , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[19]  Kesheng Wu,et al.  Apply Block Index Technique to Scientific Data Analysis and I/O Systems , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[20]  John Shalf,et al.  Query-driven visualization of large data sets , 2005, VIS 05. IEEE Visualization, 2005..

[21]  Robert Latham,et al.  Compressing the Incompressible with ISABELA: In-situ Reduction of Spatio-temporal Data , 2011, Euro-Par.

[22]  Guangwen Yang,et al.  The Chunk-Locality Index: An Efficient Query Method for Climate Datasets , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[23]  Arie Shoshani,et al.  Parallel I/O, analysis, and visualization of a trillion particle simulation , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.