Multi-objective scheduling of MapReduce jobs in big data processing

Data generation has increased drastically over the past few years due to the rapid development of Internet-based technologies. This period has been called the big data era. Big data offer an emerging paradigm shift in data exploration and utilization. The MapReduce computational paradigm is a well-known framework and is considered the main enabler for the distributed and scalable processing of a large amount of data. However, despite recent efforts toward improving the performance of MapReduce, scheduling MapReduce jobs across multiple nodes has been considered a multi-objective optimization problem. This problem can become increasingly complex when virtualized clusters in cloud computing are used to execute a large number of tasks. This study aims to optimize MapReduce job scheduling based on the completion time and cost of cloud service models. First, the problem is formulated as a multi-objective model. The model consists of two objective functions, namely, (i) completion time and (ii) cost minimization. Second, a scheduling algorithm using earliest finish time scheduling that considers resource allocation and job scheduling in the cloud is proposed. Lastly, experimental results show that the proposed scheduler exhibits better performance than other well-known schedulers, such as FIFO and Fair.

[1]  Geoffrey C. Fox,et al.  Improving Resource Utilization in MapReduce , 2012, 2012 IEEE International Conference on Cluster Computing.

[2]  Douglas G. Down,et al.  COSHH: A classification and optimization based scheduler for heterogeneous Hadoop systems , 2014, Future Gener. Comput. Syst..

[3]  Shengzhong Feng,et al.  Improving Data Locality of MapReduce by Scheduling in Homogeneous Computing Environments , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications.

[4]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[5]  Jared Flatow,et al.  Disco: a computing platform for large-scale data analytics , 2011, Erlang '11.

[6]  Albert Y. Zomaya,et al.  A survey on resource allocation in high performance distributed computing systems , 2013, Parallel Comput..

[7]  Imad Aad,et al.  The Mobile Data Challenge: Big Data for Mobile Computing Research , 2012 .

[8]  H. V. Jagadish Big Data and Science: Myths and Reality , 2015, Big Data Res..

[9]  Yu. G. Smetanin,et al.  A review of cloud computing , 2011, Scientific and Technical Information Processing.

[10]  Christos Doulkeridis,et al.  A survey of large-scale analytical query processing in MapReduce , 2013, The VLDB Journal.

[11]  Murali S. Kodialam,et al.  Scheduling in mapreduce-like systems for fast completion time , 2011, 2011 Proceedings IEEE INFOCOM.

[12]  Ali Raza Butt,et al.  [phi]Sched: A Heterogeneity-Aware Hadoop Workflow Scheduler , 2014, 2014 IEEE 22nd International Symposium on Modelling, Analysis & Simulation of Computer and Telecommunication Systems.

[13]  Dag Johansen,et al.  Oivos: Simple and Efficient Distributed Data Processing , 2008, 2008 10th IEEE International Conference on High Performance Computing and Communications.

[14]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[15]  Ramesh K. Sitaraman,et al.  Optimizing MapReduce for Highly Distributed Environments , 2012, ArXiv.

[16]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..

[17]  Ananta Tiwari,et al.  PEBIL: binary instrumentation for practical data-intensive program analysis , 2013, Cluster Computing.

[18]  Arun Kumar Sangaiah,et al.  Search space-based multi-objective optimization evolutionary algorithm , 2017, Comput. Electr. Eng..

[19]  Mingfa Zhu,et al.  MIMP: Deadline and Interference Aware Scheduling of Hadoop Virtual Machines , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[20]  Kwang Mong Sim,et al.  A comparative review of job scheduling for MapReduce , 2011, 2011 IEEE International Conference on Cloud Computing and Intelligence Systems.

[21]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[22]  Radu Prodan,et al.  Multi-objective workflow scheduling in Amazon EC2 , 2014, Cluster Computing.

[23]  Yang Wang,et al.  Budget-Driven Scheduling Algorithms for Batches of MapReduce Jobs in Heterogeneous Clouds , 2014, IEEE Transactions on Cloud Computing.

[24]  Hai Jin,et al.  Maestro: Replica-Aware Map Scheduling for MapReduce , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[25]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[26]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[27]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[28]  Huang Yi-shuang,et al.  Survey of MapReduce Parallel Programming Model , 2011 .

[29]  Sherif Sakr,et al.  The family of mapreduce and large-scale data processing systems , 2013, CSUR.

[30]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[31]  Kemafor Anyanwu,et al.  Scheduling Hadoop Jobs to Meet Deadlines , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[32]  NIDHI TIWARI,et al.  Classification Framework of MapReduce Scheduling Algorithms , 2015, ACM Comput. Surv..

[33]  Ciprian Dobre,et al.  MOMTH: multi-objective scheduling algorithm of many tasks in Hadoop , 2015, Cluster Computing.

[34]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[35]  Wei Chen,et al.  MORM: A Multi-objective Optimized Replication Management strategy for cloud storage cluster , 2014, J. Syst. Archit..

[36]  N. B. Anuar,et al.  The rise of "big data" on cloud computing: Review and open research issues , 2015, Inf. Syst..

[37]  Luiz Fernando Bittencourt,et al.  HCOC: a cost optimization algorithm for workflow scheduling in hybrid clouds , 2011, Journal of Internet Services and Applications.

[38]  Daniel A. Menascé,et al.  A Taxonomy of Job Scheduling on Distributed Computing Systems , 2016, IEEE Transactions on Parallel and Distributed Systems.