Apply Block Index Technique to Scientific Data Analysis and I/O Systems

Scientific discoveries are increasingly relying on analysis of massive amounts of data. The ability to directly access the most relevant data records through query, without shifting through all of them becomes essential. However, scientific datasets are commonly stored on parallel file systems and I/O systems that are optimized for reading/writing large chunks of data, and many scientific datasets have spatial-temporal data similarity, such that the records with similar values often locate in a close proximity of each other. Therefore, our previous work started to investigate the benefit of using block range index technique for scientific datasets, which only records the value range of all the records in a data block. In this paper, we extend our work in several aspects. First, we implement and integrate our blockindex technique with the ADIOS I/O system. Second, we show our proposed method can be significantly better than the existing minmax and bitmaps indexing methods supported in ADIOS, and can also have comparable performance in the worst case. Third, we propose several techniques that can take advantage of the block index information to greatly reduce data retrieval time from query results. Fourth, we evaluate our approach using several real scientific datasets, and analyze the spatial-temporal data similarity characteristics in them. Through our study, we believe block index can be an effective indexing technique for scientific datasets with little implementation and operating overhead. It's size is small enough for building the indexes on-the-fly, and yet its query information is sufficient for efficient data access.

[1]  Arie Shoshani,et al.  Parallel I/O, analysis, and visualization of a trillion particle simulation , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Xiaocheng Zou,et al.  Scalable in situ scientific data encoding for analytical query processing , 2013, HPDC.

[3]  Kevin W. Boyack,et al.  Data-centric computing with the Netezza architecture. , 2006 .

[4]  Arie Shoshani,et al.  Parallel data analysis directly on scientific file formats , 2014, SIGMOD Conference.

[5]  Kesheng Wu,et al.  Indexing Blocks to Reduce Space and Time Requirements for Searching Large Data Files , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[6]  Robert Latham,et al.  Compressing the Incompressible with ISABELA: In-situ Reduction of Spatio-temporal Data , 2011, Euro-Par.

[7]  Guangwen Yang,et al.  The Chunk-Locality Index: An Efficient Query Method for Climate Datasets , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[8]  Elizabeth O'Neil,et al.  Database--Principles, Programming, and Performance , 1994 .

[9]  Arie Shoshani,et al.  Scientific Data Management - Challenges, Technology, and Deployment , 2009, Scientific Data Management.

[10]  Prabhat,et al.  FastBit: interactively searching massive data , 2009 .

[11]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[12]  Patrick E. O'Neil,et al.  Model 204 Architecture and Performance , 1987, HPTS.

[13]  Michael Stonebraker,et al.  A Demonstration of SciDB: A Science-Oriented DBMS , 2009, Proc. VLDB Endow..

[14]  Arie Shoshani,et al.  Hello ADIOS: the challenges and lessons of developing leadership class I/O frameworks , 2014, Concurr. Comput. Pract. Exp..

[15]  Arie Shoshani,et al.  Parallel index and query for large scale data analysis , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).