MapReduce has emerged as a viable competitor to database systems in big data analytics. MapReduce programs are being written for a wide variety of application domains including business data processing, text analysis, natural language processing, Web graph and social network analysis, and computational science. However, MapReduce systems lack a feature that has been key to the historical success of database systems, namely, cost-based optimization. A major challenge here is that, to the MapReduce system, a program consists of black-box map and reduce functions written in some programming language like C++, Java, Python, or Ruby. Starfish is a self-tuning system for big data analytics that includes, to our knowledge, the first Cost-based Optimizer for simple to arbitrarily complex MapReduce programs. Starfish also includes a Profiler to collect detailed statistical information from unmodified MapReduce programs, and a What-if Engine for fine-grained cost estimation. This demonstration will present the profiling, whatif analysis, and cost-based optimization of MapReduce programs in Starfish. We will show how (nonexpert) users can employ the Starfish Visualizer to (a) get a deep understanding of a MapReduce program’s behavior during execution, (b) ask hypothetical questions on how the program’s behavior will change when parameter settings, cluster resources, or input data properties change, and (c) ultimately optimize the program.
[1]
Sanjay Ghemawat,et al.
MapReduce: Simplified Data Processing on Large Clusters
,
2004,
OSDI.
[2]
Shivnath Babu,et al.
Towards automatic optimization of MapReduce programs
,
2010,
SoCC '10.
[3]
Tao Ye,et al.
A recursive random search algorithm for large-scale network parameter configuration
,
2003,
SIGMETRICS '03.
[4]
Jimmy J. Lin,et al.
Book Reviews: Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer
,
2010,
CL.
[5]
Herodotos Herodotou,et al.
Profiling, what-if analysis, and cost-based optimization of MapReduce programs
,
2011,
Proc. VLDB Endow..
[6]
Liang Dong,et al.
Starfish: A Self-tuning System for Big Data Analytics
,
2011,
CIDR.
[7]
Bryan Cantrill,et al.
Dynamic Instrumentation of Production Systems
,
2004,
USENIX Annual Technical Conference, General Track.