Critical Path based Performance Models for Distributed Queries

Programming models such as MapReduce and DryadLINQ provide programmers with declarative abstractions (such as SQL like query languages) for writing data intensive computations. The models also provide runtime systems that can execute these queries on a large cluster of machines, while dealing with the vagaries of distribution such as messaging, failures and synchronization. However, this level of abstraction comes at a cost – the inability to understand, predict and debug performance. In this paper, we propose a performance modelling approach for predicting the execution time of distributed queries. Our modeling approach is based on a combination of the critical path method, empirically generated black box models and cardinality estimation techniques from databases. We evaluate the models using several real world applications and find that models can accurately predict execution time to within 10% of actual execution time. We demonstrate the usefulness of the model in identifying performance bottlenecks, both during design and while debugging performance problems.

[1]  Wei Hong,et al.  Exploiting inter-operation parallelism in XPRS , 1992, SIGMOD '92.

[2]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[3]  Georgios C. Anagnostopoulos,et al.  A Scalable and Efficient Outlier Detection Strategy for Categorical Data , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[4]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[5]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[6]  Jarek Gryz,et al.  A Survey of Query Optimization in Parallel Databases , 1999 .

[7]  Alin Dobra Histograms revisited: when are histograms the best approximation method for aggregates over joins? , 2005, PODS '05.

[8]  Sunita Mahajan,et al.  A Survey of Issues of Query Optimization in Parallel Databases , 2010 .

[9]  Rajeev Motwani,et al.  Towards estimation error guarantees for distinct values , 2000, PODS.

[10]  Ling Huang,et al.  Mantis: Predicting System Performance through Program Analysis and Modeling , 2010, ArXiv.

[11]  Peter Z. Kunszt,et al.  Data Mining the SDSS SkyServer Database , 2002, WDAS.

[12]  Christopher Ré,et al.  Automatic Optimization for MapReduce Programs , 2011, Proc. VLDB Endow..

[13]  Georgios C. Anagnostopoulos,et al.  A Scalable and Efficient Outlier Detection Strategy for Categorical Data , 2007 .

[14]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[15]  Sumit Gulwani,et al.  SPEED: precise and efficient static estimation of program computational complexity , 2009, POPL '09.

[16]  Michael Isard,et al.  TidyFS: A Simple and Small Distributed File System , 2011, USENIX Annual Technical Conference.

[17]  M. Howard Williams,et al.  STEADY - A Tool for Predicting Performance of Parallel DBMSs , 2000, Computer Performance Evaluation / TOOLS.

[18]  Srikanth Kandula,et al.  Jockey: guaranteed job latency in data parallel clusters , 2012, EuroSys '12.

[19]  Michael Isard,et al.  Distributed data-parallel computing using a high-level programming language , 2009, SIGMOD Conference.

[20]  Patrick Valduriez,et al.  On the Effectiveness of Optimization Search Strategies for Parallel Execution Spaces , 1993, VLDB.

[21]  Archana Ganapathi,et al.  Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[22]  Myra Spiliopoulou,et al.  Parallel Optimization of Large Join Queries with Set Operators and Aggregates in a Parallel Environment Supporting Pipeline , 1996, IEEE Trans. Knowl. Data Eng..

[23]  Rajeev Motwani,et al.  Optimization Algorithms for Exploiting the Parallelism-Communication Tradeoff in Pipelined Parallelism , 1994, VLDB.

[24]  Sumit Ganguly,et al.  Query optimization for parallel execution , 1992, SIGMOD '92.

[25]  Surajit Chaudhuri,et al.  Estimating progress of execution for SQL queries , 2004, SIGMOD '04.

[26]  Srikanth Kandula,et al.  Reoptimizing Data Parallel Computing , 2012, NSDI.

[27]  Jeffrey F. Naughton,et al.  Toward a progress indicator for database queries , 2004, SIGMOD '04.

[28]  Jeffrey F. Naughton,et al.  Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.

[29]  Irfan-Ullah Awan,et al.  Performance Evaluation of Database Designs , 2010, 2010 24th IEEE International Conference on Advanced Information Networking and Applications.

[30]  Craig Chambers,et al.  FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.