Job-Aware Scheduling for Big Data Processing

Most big data jobs are network-bound, which involve large amount of data transfers among the nodes in a cluster. Optimizing the scheduling of flows can improve big data job performance. Traditional techniques are mostly flow-based scheduling, without considering the flow correlations. In this paper, we take the dependency of the flows into account and propose traffic forecasting and job-aware priority scheduling for big data processing. First, we forecast the network traffic for flows of the same job through run-time monitoring, and assign a unique priority for each job and tag every packet in this job. Then we schedule flows of the same priority (often the same job) in a FIFO order. We implement our proposed scheme using NS-2 simulator and show that our system can increase the network utilization and reduce the job completion time.

[1]  Amin Vahdat,et al.  Hedera: Dynamic Flow Scheduling for Data Center Networks , 2010, NSDI.

[2]  Ion Stoica,et al.  Coflow: a networking abstraction for cluster applications , 2012, HotNets-XI.

[3]  Praveen Yalagandula,et al.  Mahout: Low-overhead datacenter traffic management using end-host-based elephant detection , 2011, 2011 Proceedings IEEE INFOCOM.

[4]  Michael I. Jordan,et al.  Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM.

[5]  Amin Vahdat,et al.  Helios: a hybrid electrical/optical switch architecture for modular data centers , 2010, SIGCOMM '10.

[6]  Zhiqiang Ma,et al.  HadoopWatch: A first step towards comprehensive traffic forecasting in cloud computing , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[7]  Antony I. T. Rowstron,et al.  Decentralized task-aware scheduling for data center networks , 2014, SIGCOMM.

[8]  Mark Handley,et al.  Improving datacenter performance and robustness with multipath TCP , 2011, SIGCOMM.

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Anupam Das,et al.  Transparent and Flexible Network Management for Big Data Processing in the Cloud , 2013, HotCloud.

[11]  Ion Stoica,et al.  Coflow: An Application Layer Abstraction for Cluster Networking , 2012 .

[12]  Ramana Rao Kompella,et al.  On the impact of packet spraying in data center networks , 2013, 2013 Proceedings IEEE INFOCOM.

[13]  Hitesh Ballani,et al.  Decentralized task-aware scheduling for data center networks , 2015, SIGCOMM 2015.