HadoopToSQL: a mapReduce query optimizer

MapReduce is a cost-effective way to achieve scalable performance for many log-processing workloads. These workloads typically process their entire dataset. MapReduce can be inefficient, however, when handling business-oriented workloads, especially when these workloads access only a subset of the data. HadoopToSQL seeks to improve MapReduce performance for the latter class of workloads by transforming MapReduce queries to use the indexing, aggregation and grouping features provided by SQL databases. It statically analyzes the computation performed by the MapReduce queries. The static analysis uses symbolic execution to derive preconditions and postconditions for the map and reduce functions. It then uses this information either to generate input restrictions, which avoid scanning the entire dataset, or to generate equivalent SQL queries, which take advantage of SQL grouping and aggregation features. We demonstrate the performance of MapReduce queries, when optimized by HadoopToSQL, by both single-node and cluster experiments. HadoopToSQL always improves performance over MapReduce and approximates that of hand-written SQL.

[1]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[2]  William R. Cook,et al.  Extracting queries by static analysis of transparent persistence , 2007, POPL '07.

[3]  Kiyoung Kim,et al.  MRBench: A Benchmark for MapReduce Framework , 2008, 2008 14th IEEE International Conference on Parallel and Distributed Systems.

[4]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[5]  David J. DeWitt,et al.  A performance analysis of the gamma database machine , 1988, SIGMOD '88.

[6]  Michael Stonebraker,et al.  MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[7]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[8]  Laurie Hendren,et al.  Soot: a Java bytecode optimization framework , 2010, CASCON.

[9]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[10]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[13]  Copyright © Intel Corporation 2008 * Other names and brands may be claimed as the property of others , 2004 .

[14]  William R. Cook,et al.  Interprocedural query extraction for transparent persistence , 2008, OOPSLA.

[15]  Willy Zwaenepoel,et al.  JReq: Database Queries in Imperative Languages , 2010, CC.

[16]  David Maier,et al.  Development of an object-oriented DBMS , 1986, OOPLSA '86.

[17]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[18]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[19]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[20]  Willy Zwaenepoel,et al.  Queryll: Java Database Queries Through Bytecode Rewriting , 2006, Middleware.