A Lightweight Evaluation Framework for Table Layouts in MapReduce Based Query Systems

Table layout determines the way how the relational row-column data values are organized and stored. In recent years, considerable candidates have been developed in MapReduce based query systems; they differ on storage space utilization, data loading time, query performance and so on. In most time, users are confronted with the problem of choosing the comprehensive optimum table layout given the workloads and the schema of tables. The straightforward way to run queries on generated data and compare the results is time consuming, and incurs the inaccuracy due to the MapReduce’s nondeterministic execution runtime. In this paper, we propose a lightweight framework to evaluate table layouts without running the query. The framework adopts the black box method to test critical metrics, and the query aware strategy that extracts table-layout-related operations from query. Based on the metrics and operations, the framework makes suggestions to users. We conduct extensive experiments to empirically study the popular table layouts. Through the results illustration, we discover that column projection and compression are the most two prominent factors for general cases. Moreover, we discuss optimization chances for the intermediate tables produced in high level language systems.

[1]  Siyuan Ma,et al.  Understanding Insights into the Basic Structure and Essential Issues of Table Placement Methods in Clusters , 2013, Proc. VLDB Endow..

[2]  Jignesh M. Patel,et al.  Column-Oriented Storage Techniques for MapReduce , 2011, Proc. VLDB Endow..

[3]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[4]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  Setrag Khoshafian,et al.  A decomposition storage model , 1985, SIGMOD Conference.

[7]  Jorge-Arnulfo Quiané-Ruiz,et al.  Trojan data layouts: right shoes for a running elephant , 2011, SoCC.

[8]  Beng Chin Ooi,et al.  Llama: leveraging columnar storage for scalable join processing in the MapReduce framework , 2011, SIGMOD '11.

[9]  Zhiwei Xu,et al.  RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[10]  Herodotos Herodotou,et al.  Stubby: A Transformation-based Optimizer for MapReduce Workflows , 2012, Proc. VLDB Endow..

[11]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[12]  Fusheng Wang,et al.  YSmart: Yet Another SQL-to-MapReduce Translator , 2011, 2011 31st International Conference on Distributed Computing Systems.

[13]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[14]  Daniel J. Abadi,et al.  Column-stores vs. row-stores: how different are they really? , 2008, SIGMOD Conference.

[15]  Goetz Graefe,et al.  Query processing techniques for solid state drives , 2009, SIGMOD Conference.

[16]  George Kollios,et al.  MRShare , 2010, Proc. VLDB Endow..

[17]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[18]  Jin Xiong,et al.  Mastiff: A MapReduce-based System for Time-Based Big Data Analytics , 2012, 2012 IEEE International Conference on Cluster Computing.