OFScheduler: A Dynamic Network Optimizer for MapReduce in Heterogeneous Cluster

MapReduce is a popular programming paradigm in cloud computing due to its excellent scalability for processing large-scale data. However, MapReduce performs poorly in heterogeneous clusters. One of the reasons is that Hadoop’s built-in load balancing algorithm for Map function leads to excessive network traffic. We propose a new dynamic network optimizer called OFScheduler for heterogeneous clusters to relieve the network traffic during the execution of MapReduce jobs. The optimizer focuses on reducing bandwith competition, balancing the workload of network links and increasing bandwidth utilization. The proposed optimizer tags different types of traffic and utilize the Openflow to adjust transfers of flows dynamically. We instantiate a simulator and an OpenFlow testbed for evaluation. The simulation results demonstrate that the proposed optimizer has a significant effect on increasing bandwidth utilization and improving the performance of MapReduce by 24 ~ 63 % for most of jobs in a multi-path heterogeneous cluster. The experiment results show that the proposed optimizer can be deployed into a real environment.

[1]  Nick McKeown,et al.  OpenFlow: enabling innovation in campus networks , 2008, CCRV.

[2]  Martín Casado,et al.  Extending Networking into the Virtualization Layer , 2009, HotNets.

[3]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4]  Amin Vahdat,et al.  Scale-Out Networking in the Data Center , 2010, IEEE Micro.

[5]  Yun Tian,et al.  Improving MapReduce performance through data placement in heterogeneous Hadoop clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[6]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[7]  Yashar Ganjali,et al.  HyperFlow: A Distributed Control Plane for OpenFlow , 2010, INM/WREN.

[8]  Albert G. Greenberg,et al.  Sharing the Data Center Network , 2011, NSDI.

[9]  Judy Qiu,et al.  Accelerating Data Transfers In Iterative MapReduce Framework , 2012 .

[10]  Chao Tian,et al.  A Dynamic MapReduce Scheduler for Heterogeneous Workloads , 2009, 2009 Eighth International Conference on Grid and Cooperative Computing.

[11]  Michael I. Jordan,et al.  Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM.

[12]  Amin Vahdat,et al.  Hedera: Dynamic Flow Scheduling for Data Center Networks , 2010, NSDI.

[13]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[14]  Hitesh Ballani,et al.  Towards predictable datacenter networks , 2011, SIGCOMM 2011.

[15]  Martín Casado,et al.  NOX: towards an operating system for networks , 2008, CCRV.

[16]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[17]  Koji Okamura,et al.  Design and implementation of application based routing using OpenFlow , 2010, CFI.

[18]  T. N. Vijaykumar,et al.  Tarazu: optimizing MapReduce on heterogeneous clusters , 2012, ASPLOS XVII.

[19]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[20]  Hwee Pink Tan,et al.  Enhancing responsiveness and scalability for OpenFlow networks via control-message quenching , 2012, 2012 International Conference on ICT Convergence (ICTC).

[21]  Praveen Yalagandula,et al.  Mahout: Low-overhead datacenter traffic management using end-host-based elephant detection , 2011, 2011 Proceedings IEEE INFOCOM.

[22]  Seyong Lee,et al.  PUMA: Purdue MapReduce Benchmarks Suite , 2012 .