Henosis: workload-driven small array consolidation and placement for HDF5 applications on heterogeneous data stores

Scientific data analysis pipelines face scalability bottlenecks when processing massive datasets that consist of millions of small files. Such datasets commonly arise in domains as diverse as detecting supernovae and post-processing computational fluid dynamics simulations. Furthermore, applications often use inference frameworks such as TensorFlow and PyTorch whose naive I/O methods exacerbate I/O bottlenecks. One solution is to use scientific file formats, such as HDF5 and FITS, to organize small arrays in one big file. However, storing everything in one file does not fully leverage the heterogeneous data storage capabilities of modern clusters. This paper presents Henosis, a system that intercepts data accesses inside the HDF5 library and transparently redirects I/O to the in-memory Redis object store or the disk-based TileDB array store. During this process, Henosis consolidates small arrays into bigger chunks and intelligently places them in data stores. A critical research aspect of Henosis is that it formulates object consolidation and data placement as a single optimization problem. Henosis carefully constructs a graph to capture the I/O activity of a workload and produces an initial solution to the optimization problem using graph partitioning. Henosis then refines the solution using a hill-climbing algorithm which migrates arrays between data stores to minimize I/O cost. The evaluation on two real scientific data analysis pipelines shows that consolidation with Henosis makes I/O 300× faster than directly reading small arrays from TileDB and 3.5× faster than workload-oblivious consolidation methods. Moreover, jointly optimizing consolidation and placement in Henosis makes I/O 1.7× faster than strategies that perform consolidation and placement independently.

[1]  Tiago Macedo,et al.  Redis Cookbook , 2011 .

[2]  Rodger Staden,et al.  ZTR: a new format for DNA sequence trace data , 2002, Bioinform..

[3]  Carlo Curino,et al.  Schism , 2010, Proc. VLDB Endow..

[4]  P. Cacella,et al.  The ASAS-SN Bright Supernova Catalog -- III. 2016 , 2017 .

[5]  Magdalena Balazinska,et al.  ArrayStore: a storage manager for complex parallel array processing , 2011, SIGMOD '11.

[6]  Indranil Gupta,et al.  Ambry: LinkedIn's Scalable Geo-Distributed Object Store , 2016, SIGMOD Conference.

[7]  Carlo Curino,et al.  Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems , 2012, SIGMOD Conference.

[8]  Kenneth A. Ross,et al.  An Object Placement Advisor for DB2 Using Solid State Storage , 2009, Proc. VLDB Endow..

[9]  Anastasios Sidiropoulos,et al.  Chasing Similarity: Distribution-aware Aggregation Scheduling , 2018, Proc. VLDB Endow..

[10]  Aniruddha R. Thakar,et al.  Sloan Digital Sky Survey IV: Mapping the Milky Way, Nearby Galaxies, and the Distant Universe , 2017, 1703.00052.

[11]  K. Bowers,et al.  Ultrahigh performance three-dimensional electromagnetic relativistic kinetic plasma simulationa) , 2008 .

[12]  Vivek R. Narasayya,et al.  Automatic physical design tuning: workload as a sequence , 2006, SIGMOD Conference.

[13]  Gerhard Weikum,et al.  Data partitioning and load balancing in parallel disk systems , 1998, The VLDB Journal.

[14]  Ricardo Bianchini,et al.  Page placement in hybrid memory systems , 2011, ICS '11.

[15]  Sergei Vassilvitskii,et al.  Sharding social networks , 2013, WSDM.

[16]  Feng Chen,et al.  Hystor: making the best use of solid state drives in high performance storage systems , 2011, ICS '11.

[17]  Feng Chen,et al.  Pacaca: Mining Object Correlations and Parallelism for Enhancing User Experience with Cloud Storage , 2018, 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).

[18]  Sanjeev Kumar,et al.  Finding a Needle in Haystack: Facebook's Photo Storage , 2010, OSDI.

[19]  Srinivasan Parthasarathy,et al.  Stratification driven placement of complex data: A framework for distributed data analytics , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[20]  D. Bersier,et al.  The ASAS-SN Bright Supernova Catalog – II. 2015 , 2016, 1704.02320.

[21]  Kesheng Wu,et al.  ArrayBridge: Interweaving Declarative Array Processing in SciDB with Imperative HDF5-Based Programs , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[22]  A. Robock,et al.  Coupled Model Intercomparison Project 5 (CMIP5) simulations of climate following volcanic eruptions , 2012 .

[23]  Paul G. Brown,et al.  Overview of sciDB: large scale array storage, processing and analysis , 2010, SIGMOD Conference.

[24]  Datta V. Gaitonde,et al.  A high-fidelity method to analyze perturbation evolution in turbulent flows , 2016, J. Comput. Phys..

[25]  Yu Cheng,et al.  Parallel in-situ data processing with speculative loading , 2014, SIGMOD Conference.

[26]  Michael Stonebraker,et al.  Dynamic Prefetching of Data Tiles for Interactive Visualization , 2016, SIGMOD Conference.

[27]  Liwen Sun,et al.  Fine-grained partitioning for aggressive data skipping , 2014, SIGMOD Conference.

[28]  Ali Raza Butt,et al.  CAST: Tiering Storage for Data Analytics in the Cloud , 2015, HPDC.

[29]  Anastasia Ailamaki,et al.  AutoPart: automating schema design for large scientific databases using data partitioning , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[30]  Stavros Papadopoulos,et al.  The TileDB Array Data Storage Manager , 2016, Proc. VLDB Endow..

[31]  Srinivasan Parthasarathy,et al.  A generalized framework for mining spatio-temporal patterns in scientific data , 2005, KDD '05.

[32]  Pablo Rodriguez,et al.  The little engine(s) that could: scaling online social networks , 2010, SIGCOMM '10.

[33]  Arie Shoshani,et al.  Parallel data analysis directly on scientific file formats , 2014, SIGMOD Conference.

[34]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[35]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[36]  Ludmila Cherkasova,et al.  ProfDP: A Lightweight Profiler to Guide Data Placement in Heterogeneous Memory Systems , 2018, ICS.

[37]  Chun Zhang,et al.  Automating physical database design in a parallel database , 2002, SIGMOD '02.

[38]  Hai Jin,et al.  Hardware/software cooperative caching for hybrid DRAM/NVM memory architectures , 2017, ICS '17.

[39]  Wei Lin,et al.  Advanced partitioning techniques for massively distributed computation , 2012, SIGMOD Conference.

[40]  Vivek R. Narasayya,et al.  Integrating vertical and horizontal partitioning into automated physical database design , 2004, SIGMOD '04.