ArrayBridge: Interweaving declarative array processing with high-performance computing

Author(s): Xing, Haoyuan; Floratos, Sofoklis; Blanas, Spyros; Byna, Suren; Prabhat; Wu, Kesheng; Brown, Paul | Abstract: Scientists are increasingly turning to datacenter-scale computers to produce and analyze massive arrays. Despite decades of database research that extols the virtues of declarative query processing, scientists still write, debug and parallelize imperative HPC kernels even for the most mundane queries. This impedance mismatch has been partly attributed to the cumbersome data loading process; in response, the database community has proposed in situ mechanisms to access data in scientific file formats. Scientists, however, desire more than a passive access method that reads arrays from files. This paper describes ArrayBridge, a bi-directional array view mechanism for scientific file formats, that aims to make declarative array manipulations interoperable with imperative file-centric analyses. Our prototype implementation of ArrayBridge uses HDF5 as the underlying array storage library and seamlessly integrates into the SciDB open-source array database system. In addition to fast querying over external array objects, ArrayBridge produces arrays in the HDF5 file format just as easily as it can read from it. ArrayBridge also supports time travel queries from imperative kernels through the unmodified HDF5 API, and automatically deduplicates between array versions for space efficiency. Our extensive performance evaluation in NERSC, a large-scale scientific computing facility, shows that ArrayBridge exhibits statistically indistinguishable performance and I/O scalability to the native SciDB storage engine.

[1]  Robin Marjoribanks,et al.  Plasma mirrors for ultrahigh-intensity optics , 2007 .

[2]  Abraham Silberschatz,et al.  Invisible loading: access-driven data transfer from raw files into database systems , 2013, EDBT '13.

[3]  Magdalena Balazinska,et al.  Time travel in a scientific array database , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[4]  Yu Cheng,et al.  A Survey on Array Storage, Query Languages, and Systems , 2013, ArXiv.

[5]  Thomas Heinis,et al.  Just-In-Time Data Virtualization: Lightweight Data Management with ViDa , 2015, CIDR.

[6]  Guangwen Yang,et al.  SciHive: Array-Based Query Processing with HiveQL , 2013, 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications.

[7]  Hal Finkel,et al.  HACC , 2016, Commun. ACM.

[8]  Michael Stonebraker,et al.  Efficient Versioning for Scientific Array Databases , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[9]  Gerd Heber,et al.  An overview of the HDF5 technology suite and its applications , 2011, AD '11.

[10]  Peter Baumann,et al.  Efficient execution of operations in a DBMS for multidimensional arrays , 1998, Proceedings. Tenth International Conference on Scientific and Statistical Database Management (Cat. No.98TB100243).

[11]  Yi Wang,et al.  SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[12]  Michael Stonebraker,et al.  Skew-Aware Join Optimization for Array Databases , 2015, SIGMOD Conference.

[13]  Yu Cheng,et al.  SCANRAW: A Database Meta-Operator for Parallel In-Situ Processing and Loading , 2015, TODS.

[14]  Tatsuo Tsuji,et al.  A storage scheme for multidimensional data alleviating dimension dependency , 2008, 2008 Third International Conference on Digital Information Management.

[15]  John Shalf,et al.  Tuning HDF5 for Lustre File Systems , 2010 .

[16]  Erez Zadok,et al.  Unifying biological image formats with HDF5 , 2009, CACM.

[17]  Paul G. Brown,et al.  Overview of sciDB: large scale array storage, processing and analysis , 2010, SIGMOD Conference.

[18]  Russ Rew,et al.  NetCDF: an interface for scientific data access , 1990, IEEE Computer Graphics and Applications.

[19]  Ying Zhang,et al.  SciQL: bridging the gap between science and relational DBMS , 2011, IDEAS '11.

[20]  Magdalena Balazinska,et al.  ArrayStore: a storage manager for complex parallel array processing , 2011, SIGMOD '11.

[21]  Kesheng Wu,et al.  FastBit: An Efficient Indexing Technology For Accelerating Data-Intensive Science , 2005 .

[22]  Surendra Byna,et al.  Parallel query evaluation as a Scientific Data Service , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[23]  K. Bowers,et al.  Ultrahigh performance three-dimensional electromagnetic relativistic kinetic plasma simulationa) , 2008 .

[24]  Kenneth Salem,et al.  Query processing techniques for arrays , 1999, SIGMOD '99.

[25]  John Shalf,et al.  HDF5-FastQuery: Accelerating Complex Queries on HDF Datasets using Fast Bitmap Indices , 2005, 18th International Conference on Scientific and Statistical Database Management (SSDBM'06).

[26]  Stavros Papadopoulos,et al.  The TileDB Array Data Storage Manager , 2016, Proc. VLDB Endow..

[27]  Alex van Ballegooij RAM: A Multidimensional Array DBMS , 2004, EDBT Workshops.

[28]  Limsoon Wong,et al.  A query language for multidimensional arrays: design, implementation, and optimization techniques , 1996, SIGMOD '96.

[29]  Peter Baumann,et al.  Management of multidimensional discrete data , 1994, The VLDB Journal.

[30]  Kesheng Wu,et al.  Similarity Join over Array Data , 2016, SIGMOD Conference.

[31]  Yi Wang,et al.  SAGA: array storage as a DB with support for structural aggregations , 2014, SSDBM '14.

[32]  S. Byna,et al.  Trillion Particles , 120 , 000 cores , and 350 TBs : Lessons Learned from a Hero I / O Run on Hopper , 2013 .

[33]  Arie Shoshani,et al.  Parallel data analysis directly on scientific file formats , 2014, SIGMOD Conference.

[34]  Aditya G. Parameswaran,et al.  Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff , 2015, Proc. VLDB Endow..

[35]  Yi Wang,et al.  Supporting a Light-Weight Data Management Layer over HDF5 , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[36]  Anastasia Ailamaki,et al.  NoDB: efficient query execution on raw data files , 2012, Commun. ACM.

[37]  Stepan Bulanov,et al.  Generation and pointing stabilization of multi-GeV electron beams from a laser plasma accelerator driven in a pre-formed plasma waveguidea) , 2015 .

[38]  Aditya G. Parameswaran,et al.  DataHub: Collaborative Data Science & Dataset Version Management at Scale , 2014, CIDR.

[39]  Michael Stonebraker,et al.  Requirements for Science Data Bases and SciDB , 2009, CIDR.

[40]  Quincey Koziol,et al.  Developing HDF5 for the Synchrotron Community , 2015 .

[41]  David R. Karger,et al.  Collaborative Data Analytics with DataHub , 2015, Proc. VLDB Endow..