A Cost-based Storage Format Selector for Materialization in Big Data Frameworks

Modern big data frameworks (such as Hadoop and Spark) allow multiple users to do large-scale analysis simultaneously. Typically, users deploy Data-Intensive Workflows (DIWs) for their analytical tasks. These DIWs of different users share many common parts (i.e, 50-80%), which can be materialized to reuse them in future executions. The materialization improves the overall processing time of DIWs and also saves computational resources. Current solutions for materialization store data on Distributed File Systems (DFS) by using a fixed data format. However, a fixed choice might not be the optimal one for every situation. For example, it is well-known that different data fragmentation strategies (i.e., horizontal, vertical or hybrid) behave better or worse according to the access patterns of the subsequent operations. In this paper, we present a cost-based approach which helps deciding the most appropriate storage format in every situation. A generic cost-based storage format selector framework considering the three fragmentation strategies is presented. Then, we use our framework to instantiate cost models for specific Hadoop data formats (namely SequenceFile, Avro and Parquet), and test it with realistic use cases. Our solution gives on average 33% speedup over SequenceFile, 11% speedup over Avro, 32% speedup over Parquet, and overall, it provides upto 25% performance gain.

[1]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[2]  Anastasia Ailamaki,et al.  ReCache: Reactive Caching for Fast Analytics over Heterogeneous Data , 2017, Proc. VLDB Endow..

[3]  Jorge-Arnulfo Quiané-Ruiz,et al.  WWHow! Freeing Data Storage from Cages , 2013, CIDR.

[4]  Anastasia Ailamaki,et al.  H2O: a hands-free adaptive store , 2014, SIGMOD Conference.

[5]  Norbert Ritter,et al.  Towards Automated Polyglot Persistence , 2015, BTW.

[6]  Michael Stonebraker,et al.  A Demonstration of the BigDAWG Polystore System , 2015, Proc. VLDB Endow..

[7]  Jignesh M. Patel,et al.  Column-Oriented Storage Techniques for MapReduce , 2011, Proc. VLDB Endow..

[8]  Jorge-Arnulfo Quiané-Ruiz,et al.  Trojan data layouts: right shoes for a running elephant , 2011, SoCC.

[9]  Yanpei Chen,et al.  Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads , 2012, Proc. VLDB Endow..

[10]  Xiaoyong Du,et al.  Wide Table Layout Optimization based on Column Ordering and Duplication , 2017, SIGMOD Conference.

[11]  Ryan Johnson,et al.  Here are my Data Files. Here are my Queries. Where are my Results? , 2011, CIDR.

[12]  Wolfgang Lehner,et al.  SAP HANA database: data management for modern business applications , 2012, SGMD.

[13]  Hiren Patel,et al.  Computation Reuse in Analytics Job Service at Microsoft , 2018, SIGMOD Conference.

[14]  Jingren Zhou,et al.  Exploiting Common Subexpressions for Cloud Query Processing , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[15]  Alberto Abelló,et al.  Incremental Consolidation of Data-Intensive Multi-Flows , 2016, IEEE Transactions on Knowledge and Data Engineering.

[16]  Konstantin V. Shvachko,et al.  HDFS Scalability: The Limits to Growth , 2010, login Usenix Mag..

[17]  Vladimir Vlassov,et al.  m2r2: A Framework for Results Materialization and Reuse in High-Level Dataflow Systems for Big Data , 2013, 2013 IEEE 16th International Conference on Computational Science and Engineering.

[18]  Ashraf Aboulnaga,et al.  ReStore: Reusing Results of MapReduce Jobs , 2012, Proc. VLDB Endow..

[19]  Sam Lightstone,et al.  DB2 with BLU Acceleration: So Much More than Just a Column Store , 2013, Proc. VLDB Endow..

[20]  Wolfgang Lehner,et al.  ResilientStore: A Heuristic-Based Data Format Selector for Intermediate Results , 2016, MEDI.

[21]  Hiren Patel,et al.  Selecting Subexpressions to Materialize at Datacenter Scale , 2018, Proc. VLDB Endow..

[22]  Alfonso F. Cardenas Analysis and performance of inverted data base structures , 1975, CACM.

[23]  David J. DeWitt,et al.  Split query processing in polybase , 2013, SIGMOD '13.

[24]  George Kollios,et al.  MRShare , 2010, Proc. VLDB Endow..