Deadline-Oriented Task Scheduling for MapReduce Environments

To provide timely results for 'Big Data Analytics', it is crucial to satisfy deadline requirements for MapReduce jobs in production environments. In this paper, we propose a deadline-oriented task scheduling approach, named Dart, to meet the given deadline and maximize the input size if only part of the dataset can be processed before the time limit. Dart uses an iterative estimation method which is based on both historical data and job running status to precisely estimate the real-time job completion time. By comparing the estimated time with the deadline constraint, a YARN-based task scheduler dynamically decides whether continuing or terminating the map phase. We have validated our approach using workloads from OpenCloud and Facebook on a cluster of 60 virtual machines. The results show that Dart can not only effectively meet the deadline but also process near-maximal data volumes even when the deadline is set to be extremely small and limited resources are allocated.

[1]  Archana Ganapathi,et al.  The Case for Evaluating MapReduce Performance Using Workload Suites , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[2]  Magdalena Balazinska,et al.  ParaTimer: a progress indicator for MapReduce DAGs , 2010, SIGMOD Conference.

[3]  Roy H. Campbell,et al.  ARIA: automatic resource inference and allocation for mapreduce environments , 2011, ICAC '11.

[4]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[5]  Yanpei Chen,et al.  Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads , 2012, Proc. VLDB Endow..

[6]  Malgorzata Steinder,et al.  Performance-driven task co-scheduling for MapReduce environments , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[7]  Michael I. Jordan,et al.  Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM 2011.

[8]  Shanshan Li,et al.  MapCheckReduce: An Improved MapReduce Computing Model for Imprecise Applications , 2014, 2014 IEEE International Congress on Big Data.

[9]  Magdalena Balazinska,et al.  Hadoop's Adolescence , 2013, Proc. VLDB Endow..

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[12]  Sharon L. Lohr,et al.  Sampling: Design and Analysis , 1999 .

[13]  D. G. Watts,et al.  Nonlinear Regression: Iterative Estimation and Linear Approximations , 2008 .

[14]  Minos N. Garofalakis,et al.  Approximate Query Processing: Taming the TeraBytes , 2001, VLDB.

[15]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.