论文信息 - H-WorD: Supporting Job Scheduling in Hadoop with Workload-Driven Data Redistribution

H-WorD: Supporting Job Scheduling in Hadoop with Workload-Driven Data Redistribution

Today’s distributed data processing systems typically follow a query shipping approach and exploit data locality for reducing network traffic. In such systems the distribution of data over the cluster resources plays a significant role, and when skewed, it can harm the performance of executing applications. In this paper, we address the challenges of automatically adapting the distribution of data in a cluster to the workload imposed by the input applications. We propose a generic algorithm, named H-WorD, which, based on the estimated workload over resources, suggests alternative execution scenarios of tasks, and hence identifies required transfers of input data a priori, for timely bringing data close to the execution. We exemplify our algorithm in the context of MapReduce jobs in a Hadoop ecosystem. Finally, we evaluate our approach and demonstrate the performance gains of automatic data redistribution.

Toon Calders | Alberto Abelló | Oscar Romero | Petar Jovanovic

[1] Carlo Curino,et al. Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[2] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[3] Geoffrey C. Fox,et al. Investigation of Data Locality in MapReduce , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[4] Liang Dong,et al. Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[5] Scott Shenker,et al. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[6] Ling Liu,et al. Purlieus: Locality-aware resource allocation for MapReduce in a cloud , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[7] Hairong Kuang,et al. The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[8] Lei Ying,et al. Map task scheduling in MapReduce with data locality: Throughput and heavy-traffic optimality , 2013, INFOCOM.

[9] Fang Dong,et al. BAR: An Efficient Data Locality Driven Task Scheduling Algorithm for Cloud Computing , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[10] R. Kolisch,et al. Heuristic algorithms for the resource-constrained project scheduling problem: classification and computational analysis , 1999 .

[11] Jacek Blazewicz,et al. Handbook on Scheduling: From Theory to Applications , 2014 .