Spatially clustered join on heterogeneous scientific data sets

In the era of data-intensive scientific discovery, data analysis is critical for scientists to identify essential information from the mountains of data generated by large-scale simulations or experiments. A generic operation in scientific data analysis is to combine information from multiple data sets, which are stored in heterogeneous ile formats. This operation is typically known as a Join in database management field. Currently, a join operation involving multiple data sets in different file formats is time-consuming because of the need to prepare data (i.e., to convert data into a uniform format or to ingest into a database) and to run the join algorithms. Furthermore, data processing languages, such as SQL (Structured Query Language), can not easily express typical scientific analysis tasks such as interpolation. In this paper, we propose three techniques to address these challenges: a two-level data model to process data from different file formats without converting to a uniform format, a data organization structure known as Multi-Dimensional Binning (MDBin), and a join processing algorithm known as Spatially Clustered Join (SCJoin). Together, these techniques allow scientific data files to be used for query processing with less I/O cost and fast query response time without the extra cost to perform ile format conversion and data ingestion. Evaluation of our proposed techniques in joining and interpolating data sets generated by a plasma physics simulation studying space weather phenomenon showed up to 8X improvement over FastQuery. Querying with our solution outperforms SciDB, a popular array data management system for scientific data, by 43X-143X. We also demonstrate that our methods scale to 64K CPU cores in analyzing 32TB data on a large-scale supercomputing system.

[1]  T. R. Shippert,et al.  Observational determination of surface radiative forcing by CO2 from 2000 to 2010 , 2015, Nature.

[2]  Surendra Byna,et al.  Model-Driven Data Layout Selection for Improving Read Performance , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[3]  Surendra Byna,et al.  SDS: a framework for scientific data services , 2013, PDSW@SC.

[4]  Surendra Byna,et al.  Segmented analysis for reducing data movement , 2013, 2013 IEEE International Conference on Big Data.

[5]  Matthew Huras,et al.  Multi-dimensional clustering: a new data layout scheme in DB2 , 2003, SIGMOD '03.

[6]  Yong Chen,et al.  Fast data analysis with integrated statistical metadata in scientific datasets , 2013, CLUSTER.

[7]  Arie Shoshani,et al.  Scientific data services: a high-performance I/O system with array semantics , 2011, HPCDB '11.

[8]  Jianwei Li,et al.  Parallel netCDF: A High-Performance Scientific I/O Interface , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[9]  Arie Shoshani,et al.  Parallel I/O, analysis, and visualization of a trillion particle simulation , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Arie Shoshani,et al.  Scientific Data Management - Challenges, Technology, and Deployment , 2009, Scientific Data Management.

[11]  Arie Shoshani,et al.  Breaking the Curse of Cardinality on Bitmap Indexes , 2008, SSDBM.

[12]  Prabhat,et al.  FastBit: interactively searching massive data , 2009 .

[13]  Jean-Pierre Both,et al.  imzML: Imaging Mass Spectrometry Markup Language: A common data format for mass spectrometry imaging. , 2011, Methods in molecular biology.

[14]  Arie Shoshani,et al.  Parallel data analysis directly on scientific file formats , 2014, SIGMOD Conference.

[15]  Patrick Valduriez,et al.  Join indices , 1987, TODS.

[16]  Kesheng Wu,et al.  FastQuery: A Parallel Indexing System for Scientific Data , 2011, 2011 IEEE International Conference on Cluster Computing.

[17]  Limin Xiao,et al.  A New File-Specific Stripe Size Selection Method for Highly Concurrent Data Access , 2012, 2012 ACM/IEEE 13th International Conference on Grid Computing.

[18]  Gerd Heber,et al.  An overview of the HDF5 technology suite and its applications , 2011, AD '11.

[19]  Arie Shoshani,et al.  Hello ADIOS: the challenges and lessons of developing leadership class I/O frameworks , 2014, Concurr. Comput. Pract. Exp..

[20]  Ray W. Grout,et al.  EDO: Improving Read Performance for Scientific Applications through Elastic Data Organization , 2011, 2011 IEEE International Conference on Cluster Computing.

[21]  D. Wells,et al.  Fits: a flexible image transport system , 1981 .

[22]  Surendra Byna,et al.  Parallel query evaluation as a Scientific Data Service , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[23]  Nuno Santos,et al.  The AMGA Metadata Service , 2008, Journal of Grid Computing.

[24]  John D. Owens,et al.  Bin-Hash Indexing: A Parallel Method for Fast Query Processing , 2008, ICDE 2008.

[25]  Paul G. Brown,et al.  Overview of sciDB: large scale array storage, processing and analysis , 2010, SIGMOD Conference.

[26]  Surendra Byna,et al.  Expediting scientific data analysis with reorganization of data , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).