A What-if Engine for Cost-based MapReduce Optimization

The Starfish project at Duke University aims to provide MapReduce users and applications with good performance automatically, without any need on their part to understand and manipulate the numerous tuning knobs in a MapReduce system. This paper describes the What-if Engine, an indispensable component in Starfish, which serves a similar purpose as a costing engine used by the query optimizer in a Database system. We discuss the problem and challenges addressed by the What-if Engine. We also discuss the techniques used by the What-if Engine and the design decisions that led us to these techniques.

[1]  George Kollios,et al.  MRShare , 2010, Proc. VLDB Endow..

[2]  Herodotos Herodotou,et al.  No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics , 2011, SoCC.

[3]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[4]  Gregory R. Ganger,et al.  Modeling the relative fitness of storage , 2007, SIGMETRICS '07.

[5]  Christopher Ré,et al.  Manimal: relational optimization for data-intensive programs , 2010, WebDB '10.

[6]  Jorge-Arnulfo Quiané-Ruiz,et al.  Trojan data layouts: right shoes for a running elephant , 2011, SoCC.

[7]  Beng Chin Ooi,et al.  Query optimization for massively parallel data processing , 2011, SoCC.

[8]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[9]  J. R. Quinlan Learning With Continuous Classes , 1992 .

[10]  Cheri A. Levinson,et al.  Profiling , 2012 .

[11]  Willy Zwaenepoel,et al.  HadoopToSQL: a mapReduce query optimizer , 2010, EuroSys '10.

[12]  Guanying Wang,et al.  A simulation approach to evaluating design decisions in MapReduce setups , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[13]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[14]  Herodotos Herodotou,et al.  Automatic Tuning of Data-Intensive Analytical Workloads , 2016 .

[15]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[16]  Michael I. Jordan,et al.  Statistical Machine Learning Makes Automatic Control Practical for Internet Datacenters , 2009, HotCloud.