Limited user-feedback exists in cluster computing environments such as MapReduce. Accurate, time-oriented progress indicators could provide much utility to users in this domain, where job execution times can have high variance due to the amount of data being processed, the amount of parallelism available, and the types of operators (often user-defined) that perform the processing. This feedback would help users make informed decisions, such as whether a job should be terminated and restarted at a later time when the cluster has more resources available. However, none of the techniques used by existing tools or available in the literature provide a non-trivial progress indicator for queries running in a distributed environment. In this paper, we apply recently developed techniques for estimating the progress of single-site SQL queries to parallel environments. In particular, we target environments where queries consist of MapReduce job pipelines. We also present techniques that improve the accuracy and usefulness of progress estimators operating in this environment. We implemented our estimators in the Pig system and demonstrate its performance on experiments with real data (search logs) and with a real cluster.
[1]
Surajit Chaudhuri,et al.
Estimating progress of execution for SQL queries
,
2004,
SIGMOD '04.
[2]
Jeffrey F. Naughton,et al.
Toward a progress indicator for database queries
,
2004,
SIGMOD '04.
[3]
Sanjay Ghemawat,et al.
MapReduce: Simplified Data Processing on Large Clusters
,
2004,
OSDI.
[4]
Surajit Chaudhuri,et al.
When can we trust progress estimators for SQL queries?
,
2005,
SIGMOD '05.
[5]
Jeffrey F. Naughton,et al.
Increasing the accuracy and coverage of SQL progress indicators
,
2005,
21st International Conference on Data Engineering (ICDE'05).
[6]
Philip S. Yu,et al.
Multi-query SQL Progress Indicators
,
2006,
EDBT.
[7]
Nick Koudas,et al.
A Lightweight Online Framework For Query Progress Indicators
,
2007,
2007 IEEE 23rd International Conference on Data Engineering.
[8]
Maksims Volkovs,et al.
ConEx: a system for monitoring queries
,
2007,
SIGMOD '07.
[9]
Ravi Kumar,et al.
Pig latin: a not-so-foreign language for data processing
,
2008,
SIGMOD Conference.