ResilientStore: A Heuristic-Based Data Format Selector for Intermediate Results

Large-scale data analysis is an important activity in many organizations that typically requires the deployment of data-intensive workflows. As data is processed these workflows generate large intermediate results, which are typically pipelined from one operator to the following. However, if materialized, these results become reusable, hence, subsequent workflows need not recompute them. There are already many solutions that materialize intermediate results but all of them assume a fixed data format. A fixed format, however, may not be the optimal one for every situation. For example, it is well-known that different data fragmentation strategies (e.g., horizontal and vertical) behave better or worse according to the access patterns of the subsequent operations. In this paper, we present ResilientStore, which assists on selecting the most appropriate data format for materializing intermediate results. Given a workflow and a set of materialization points, it uses rule-based heuristics to choose the best storage data format based on subsequent access patterns. We have implemented ResilientStore for HDFS and three different data formats: SequenceFile, Parquet and Avro. Experimental results show that our solution gives 18 % better performance than any solution based on a single fixed format.

[1]  Jorge-Arnulfo Quiané-Ruiz,et al.  WWHow! Freeing Data Storage from Cages , 2013, CIDR.

[2]  Ashraf Aboulnaga,et al.  ReStore: Reusing Results of MapReduce Jobs , 2012, Proc. VLDB Endow..

[3]  Jorge-Arnulfo Quiané-Ruiz,et al.  Trojan data layouts: right shoes for a running elephant , 2011, SoCC.

[4]  Alberto Abelló,et al.  Building cubes with MapReduce , 2011, DOLAP '11.

[5]  Jignesh M. Patel,et al.  Column-Oriented Storage Techniques for MapReduce , 2011, Proc. VLDB Endow..

[6]  Wolfgang Lehner,et al.  SAP HANA database: data management for modern business applications , 2012, SGMD.

[7]  Zhiwei Xu,et al.  RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[8]  Alberto Abelló,et al.  Incremental Consolidation of Data-Intensive Multi-Flows , 2016, IEEE Transactions on Knowledge and Data Engineering.

[9]  David J. DeWitt,et al.  Split query processing in polybase , 2013, SIGMOD '13.

[10]  GhemawatSanjay,et al.  The Google file system , 2003 .

[11]  Anastasia Ailamaki,et al.  H2O: a hands-free adaptive store , 2014, SIGMOD Conference.

[12]  Vladimir Vlassov,et al.  m2r2: A Framework for Results Materialization and Reuse in High-Level Dataflow Systems for Big Data , 2013, 2013 IEEE 16th International Conference on Computational Science and Engineering.

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Norbert Ritter,et al.  Towards Automated Polyglot Persistence , 2015, BTW.

[15]  Michael Stonebraker,et al.  A Demonstration of the BigDAWG Polystore System , 2015, Proc. VLDB Endow..

[16]  Yanpei Chen,et al.  Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads , 2012, Proc. VLDB Endow..

[17]  Ryan Johnson,et al.  Here are my Data Files. Here are my Queries. Where are my Results? , 2011, CIDR.

[18]  Sam Lightstone,et al.  DB2 with BLU Acceleration: So Much More than Just a Column Store , 2013, Proc. VLDB Endow..