MOMC: Multi-objective and Multi-constrained Scheduling Algorithm of Many Tasks in Hadoop

Even though scheduling in a distributed system was debated for many years, the platforms and the job types are changing everyday. This is why we need special algorithms based on new applications requirements, especially when a application is deployed in a Cloud environment. One of the most important framework used for large-scale data processing in Clouds is Hadoop and its extensions. Hadoop framework comes with default algorithms like FIFO, Fair Scheduler or Capacity Scheduler, and Hadoop on Demand. These scheduling algorithms are focused on a different and single constraint. It is hard to satisfy multiple constraints and to have a lot of objectives in the same time. After summarizing the most common schedulers, showing the need of each one in the moment it appeared on the market, this paper presents MOMC, a multi-objective and multi-constrained scheduling algorithm of many tasks in Hadoop. MOMC implementation focuses on two objectives: avoiding resource contention and having an optimal workload of the cluster, and two constraints: deadline and budget. To compare the algorithms based on different metrics, we use Scheduling Load Simulator, which is integrated in Hadoop framework and helps the developers to spend less time on testing. As killer application that generate many tasks we have chosen processing task for the Million Song Dataset, which is a set of data contains metadata for one million commercially-available songs.

[1]  Laurie J. Hendren,et al.  Dynamic metrics for java , 2003, OOPSLA '03.

[2]  Keqin Li,et al.  Future Generation Computer Systems ( ) – Future Generation Computer Systems Multi-objective Scheduling of Many Tasks in Cloud Platforms , 2022 .

[3]  Feng Niu,et al.  Million Song Dataset Challenge ! , 2012 .

[4]  Shrinivas B. Joshi,et al.  Apache hadoop performance-tuning methodologies and best practices , 2012, ICPE '12.

[5]  Achim Streit,et al.  MapReduce across Distributed Clusters for Data-intensive Applications , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[6]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[7]  Lizhe Wang,et al.  Software Design and Implementation for MapReduce across Distributed Data Centers , 2013 .

[8]  Kemafor Anyanwu,et al.  Scheduling Hadoop Jobs to Meet Deadlines , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[9]  Douglas G. Down,et al.  A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[10]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[11]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[12]  Yan Gao,et al.  Introduction and Analysis of Simulators of MapReduce , 2013, ISCTCS.

[13]  Thomas M. Keane,et al.  Scheduling in a dynamic heterogeneous distributed system using estimation error , 2008, J. Parallel Distributed Comput..

[14]  Garrick Staples,et al.  TORQUE resource manager , 2006, SC.

[15]  Jinjun Chen,et al.  A security framework in G-Hadoop for big data computing across distributed Cloud data centres , 2014, J. Comput. Syst. Sci..

[16]  Jie Huang,et al.  HiTune: Dataflow-Based Performance Analysis for Big Data Cloud , 2011, USENIX Annual Technical Conference.

[17]  Xin Wang,et al.  Research of Distributed Data Store Based on HDFS , 2013, 2013 International Conference on Computational and Information Sciences.

[18]  Rong Gu,et al.  SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters , 2014, J. Parallel Distributed Comput..

[19]  Lei Wang,et al.  Research on Job Scheduling Algorithm in Hadoop , 2011 .

[20]  Rajiv Ranjan,et al.  Parallel Processing of Massive EEG Data with MapReduce , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[21]  Geoffrey C. Fox,et al.  Improving MapReduce Performance in Heterogeneous Network Environments and Resource Utilization , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).