Indexing Blocks to Reduce Space and Time Requirements for Searching Large Data Files

Scientific discoveries are increasingly relying on analysis of massive amounts of data generated from scientific experiments, observations, and simulations. The ability to directly access the most relevant data records, without shifting through all of them becomes essential. While many indexing techniques have been developed to quickly locate the selected data records, the time and space required for building and storing these indexes are often too expensive to meet the demands of in situ or real-time data analysis. Existing indexing methods generally capture information about each individual data record, however, when reading a data record, the I/O system typically has to access a block or a page of data. In this work, we postulate that indexing blocks instead of individual data records could significantly reduce index size and index building time without increasing the I/O time for accessing the selected data records. Our experiments using multiple real datasets on a supercomputer show that block index can reduce query time by a factor of 2 to 50 over other existing methods, including SciDB and FastQuery. But the size of block index is almost negligible comparing to the data size, and the time of building index can reach the peak I/O speed.

[1]  Israel Spiegler,et al.  Storage and retrieval considerations of binary data bases , 1985, Inf. Process. Manag..

[2]  Robert Latham,et al.  ISABELA-QA: Query-driven analytics with ISABELA-compressed extreme-scale scientific data , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[3]  K. Stockinger,et al.  Detecting Distributed Scans Using High-Performance Query-Driven Visualization , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[4]  Dominic Giampaolo,et al.  Practical File System Design with the Be File System , 1998 .

[5]  John Shalf,et al.  Query-driven visualization of large data sets , 2005, VIS 05. IEEE Visualization, 2005..

[6]  R. Bayer,et al.  Organization and maintenance of large ordered indices , 1970, SIGFIDET '70.

[7]  Ben. Pontin,et al.  The IPCC fifth assessment report , 2013 .

[8]  Surendra Byna,et al.  Simplifying index file structure to improve I/O performance of parallel indexing , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).

[9]  Kesheng Wu,et al.  In-memory Query System for Scientific Dataseis , 2015, 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS).

[10]  Kwan-Liu Ma,et al.  In Situ Visualization at Extreme Scale: Challenges and Opportunities , 2009, IEEE Computer Graphics and Applications.

[11]  Steven J. Karpen Design and Implementation of a Real Time Information Storage and Retrieval System , 1971, ACM '71.

[12]  Paul G. Brown,et al.  Overview of sciDB: large scale array storage, processing and analysis , 2010, SIGMOD Conference.

[13]  Karsten Schwan,et al.  Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS) , 2008, CLADE '08.

[14]  Prabhat,et al.  FastBit: interactively searching massive data , 2009 .

[15]  Xiaocheng Zou,et al.  Scalable in situ scientific data encoding for analytical query processing , 2013, HPDC.

[16]  Arie Shoshani,et al.  In situ data processing for extreme-scale computing , 2011 .

[17]  Arie Shoshani,et al.  Scientific Data Management - Challenges, Technology, and Deployment , 2009, Scientific Data Management.

[18]  David R. O'Hallaron,et al.  Remote runtime steering of integrated terascale simulation and visualization , 2006, SC.

[19]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[20]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[21]  Surendra Byna,et al.  SDS: a framework for scientific data services , 2013, PDSW@SC.

[22]  Arie Shoshani,et al.  Parallel index and query for large scale data analysis , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).