MapReduce scheduling algorithms: a review

Recent trends in big data have shown that the amount of data continues to increase at an exponential rate. This trend has inspired many researchers over the past few years to explore new research direction of studies related to multiple areas of big data. The widespread popularity of big data processing platforms using MapReduce framework is the growing demand to further optimize their performance for various purposes. In particular, enhancing resources and jobs scheduling are becoming critical since they fundamentally determine whether the applications can achieve the performance goals in different use cases. Scheduling plays an important role in big data, mainly in reducing the execution time and cost of processing. This paper aims to survey the research undertaken in the field of scheduling in big data platforms. Moreover, this paper analyzed scheduling in MapReduce on two aspects: taxonomy and performance evaluation. The research progress in MapReduce scheduling algorithms is also discussed. The limitations of existing MapReduce scheduling algorithms and exploit future research opportunities are pointed out in the paper for easy identification by researchers. Our study can serve as the benchmark to expert researchers for proposing a novel MapReduce scheduling algorithm. However, for novice researchers, the study can be used as a starting point.

[1]  Selim G. Akl,et al.  PFAS: A Resource-Performance-Fluctuation-Aware Workflow Scheduling Algorithm for Grid Computing , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[2]  Yi-Ru Chen,et al.  Design adaptive task allocation scheduler to improve MapReduce performance in heterogeneous clouds , 2015, J. Netw. Comput. Appl..

[3]  Hai Jin,et al.  Maestro: Replica-Aware Map Scheduling for MapReduce , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[4]  Yannis E. Ioannidis,et al.  Schedule optimization for data processing flows on the cloud , 2011, SIGMOD '11.

[5]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[6]  Alexandru Iosup,et al.  Balanced resource allocations across multiple dynamic MapReduce clusters , 2014, SIGMETRICS '14.

[7]  Murali S. Kodialam,et al.  Scheduling in mapreduce-like systems for fast completion time , 2011, 2011 Proceedings IEEE INFOCOM.

[8]  Thomas Sandholm,et al.  Dynamic Proportional Share Scheduling in Hadoop , 2010, JSSPP.

[9]  Yaser Jararweh,et al.  A scalable Map Reduce tasks scheduling: a threading-based approach , 2017, Int. J. Comput. Sci. Eng..

[10]  Terry Anthony Byrd,et al.  Big data analytics: Understanding its capabilities and potential benefits for healthcare organizations , 2018 .

[11]  Christos Doulkeridis,et al.  A survey of large-scale analytical query processing in MapReduce , 2013, The VLDB Journal.

[12]  Robert D. Callaway,et al.  Multi-dimensional scheduling in cloud storage systems , 2015, 2015 IEEE International Conference on Communications (ICC).

[13]  Huang Yi-shuang,et al.  Survey of MapReduce Parallel Programming Model , 2011 .

[14]  Kun-Lung Wu,et al.  FLEX: A Slot Allocation Scheduling Optimizer for MapReduce Workloads , 2010, Middleware.

[15]  Ali Raza Butt,et al.  [phi]Sched: A Heterogeneity-Aware Hadoop Workflow Scheduler , 2014, 2014 IEEE 22nd International Symposium on Modelling, Analysis & Simulation of Computer and Telecommunication Systems.

[16]  Dragan Savic,et al.  Single-objective vs. Multiobjective Optimisation for Integrated Decision Support , 2002 .

[17]  Albert Y. Zomaya,et al.  GA-ETI: An enhanced genetic algorithm for the scheduling of scientific workflows in cloud environments , 2016, J. Comput. Sci..

[18]  Zhen Xiao,et al.  Improving MapReduce Performance Using Smart Speculative Execution Strategy , 2014, IEEE Transactions on Computers.

[19]  Jordi Torres,et al.  GreenHadoop: leveraging green energy in data-processing frameworks , 2012, EuroSys '12.

[20]  M. Anusha,et al.  Big Data-Survey , 2016 .

[21]  Jordi Torres,et al.  Performance Management of Accelerated MapReduce Workloads in Heterogeneous Clusters , 2010, 2010 39th International Conference on Parallel Processing.

[22]  Wing Cheong Lau,et al.  Optimization for Speculative Execution of Multiple Jobs in a MapReduce-like Cluster , 2014, ArXiv.

[23]  Andrey Balmin,et al.  Malleable scheduling for flows of jobs and applications to MapReduce , 2018, Journal of Scheduling.

[24]  Albert Y. Zomaya,et al.  On the Characterization of the Structural Robustness of Data Center Networks , 2013, IEEE Transactions on Cloud Computing.

[25]  Roy H. Campbell,et al.  ARIA: automatic resource inference and allocation for mapreduce environments , 2011, ICAC '11.

[26]  Yi Yao,et al.  Self-Adjusting Slot Configurations for Homogeneous and Heterogeneous Hadoop Clusters , 2017, IEEE Transactions on Cloud Computing.

[27]  Shengzhong Feng,et al.  Improving Data Locality of MapReduce by Scheduling in Homogeneous Computing Environments , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications.

[28]  Andrew J. Page,et al.  Framework for Task Scheduling in Heterogeneous Distributed Computing Using Genetic Algorithms , 2005, Artificial Intelligence Review.

[29]  Ananta Tiwari,et al.  PEBIL: binary instrumentation for practical data-intensive program analysis , 2013, Cluster Computing.

[30]  Chien-Hung Chen,et al.  MapReduce Scheduling for Deadline-Constrained Jobs in Heterogeneous Cloud Computing Systems , 2018, IEEE Transactions on Cloud Computing.

[31]  Zhen Xiao,et al.  Dynamic Resource Allocation Using Virtual Machines for Cloud Computing Environment , 2013, IEEE Transactions on Parallel and Distributed Systems.

[32]  Tianyu Wo,et al.  CREST: Towards Fast Speculation of Straggler Tasks in MapReduce , 2011, 2011 IEEE 8th International Conference on e-Business Engineering.

[33]  Thomas L. Casavant,et al.  A Taxonomy of Scheduling in General-Purpose Distributed Computing Systems , 1988, IEEE Trans. Software Eng..

[34]  Weikuan Yu,et al.  FARMS: Efficient mapreduce speculation for failure recovery in short jobs , 2017, Parallel Comput..

[35]  Murali S. Kodialam,et al.  Joint scheduling of processing and Shuffle phases in MapReduce systems , 2012, 2012 Proceedings IEEE INFOCOM.

[36]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[37]  Ying Wang,et al.  Scheduling Mixed Real-Time and Non-real-Time Applications in MapReduce Environment , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[38]  Xiang Li,et al.  OEHadoop: Accelerate Hadoop Applications by Co-Designing Hadoop With Data Center Network , 2018, IEEE Access.

[39]  Jia Yu,et al.  QoS-based scheduling of workflows on global grids , 2007 .

[40]  Xiaoqiao Meng,et al.  Delay tails in MapReduce scheduling , 2012, SIGMETRICS '12.

[41]  María S. Pérez-Hernández,et al.  Failure detector abstractions for MapReduce-based systems , 2017, Inf. Sci..

[42]  Inderveer Chana,et al.  QoS-Aware Autonomic Resource Management in Cloud Computing , 2015, ACM Comput. Surv..

[43]  Jinyuan He,et al.  Splitting Large Medical Data Sets Based on Normal Distribution in Cloud Environment , 2020, IEEE Transactions on Cloud Computing.

[44]  Cristina L. Abad,et al.  DARE: Adaptive Data Replication for Efficient Cluster Scheduling , 2011, 2011 IEEE International Conference on Cluster Computing.

[45]  Tao Zhang,et al.  A multi-objective co-evolutionary algorithm for energy-efficient scheduling on a green data center , 2016, Comput. Oper. Res..

[46]  Lucio Grandinetti,et al.  A multi-dimensional job scheduling , 2016, Future Gener. Comput. Syst..

[47]  Chen He,et al.  Matchmaking: A New MapReduce Scheduling Technique , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[48]  Lei Ying,et al.  MapTask Scheduling in MapReduce With Data Locality: Throughput and Heavy-Traffic Optimality , 2013, IEEE/ACM Transactions on Networking.

[49]  Ahmed Hadj Kacem,et al.  Exact and heuristic MapReduce scheduling algorithms for cloud federation , 2018, Comput. Electr. Eng..

[50]  M. Kumar,et al.  Tolhit – A Scheduling Algorithm for Hadoop Cluster , 2016 .

[51]  Geoffrey C. Fox,et al.  Improving Resource Utilization in MapReduce , 2012, 2012 IEEE International Conference on Cluster Computing.

[52]  Geoffrey C. Fox,et al.  Investigation of Data Locality in MapReduce , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[53]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[54]  T. N. Vijaykumar,et al.  Tarazu: optimizing MapReduce on heterogeneous clusters , 2012, ASPLOS XVII.

[55]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[56]  Douglas G. Down,et al.  COSHH: A classification and optimization based scheduler for heterogeneous Hadoop systems , 2014, Future Gener. Comput. Syst..

[57]  Daniel A. Menascé,et al.  A Taxonomy of Job Scheduling on Distributed Computing Systems , 2016, IEEE Transactions on Parallel and Distributed Systems.

[58]  Dick H. J. Epema,et al.  Scheduling Workloads of Workflows with Unknown Task Runtimes , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[59]  Jürgen Teich,et al.  Resource-aware Computer Vision Application on Heterogeneous Multi-tile Architecture , 2014 .

[60]  Mohamed Faten Zhani,et al.  PRISM: Fine-Grained Resource-Aware Scheduling for MapReduce , 2015, IEEE Transactions on Cloud Computing.

[61]  Fang Dong,et al.  BAR: An Efficient Data Locality Driven Task Scheduling Algorithm for Cloud Computing , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[62]  Changjun Jiang,et al.  Improving Performance of Heterogeneous MapReduce Clusters with Adaptive Task Tuning , 2017, IEEE Transactions on Parallel and Distributed Systems.

[63]  Bharadwaj Veeravalli,et al.  A multi-dimensional scheduling scheme in a Grid computing environment , 2007, J. Parallel Distributed Comput..

[64]  Rong Gu,et al.  SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters , 2014, J. Parallel Distributed Comput..

[65]  Jordi Torres,et al.  Resource-Aware Adaptive Scheduling for MapReduce Clusters , 2011, Middleware.

[66]  Antonio J. Rivera,et al.  MEFASD-BD: Multi-objective evolutionary fuzzy algorithm for subgroup discovery in big data environments - A MapReduce solution , 2017, Knowl. Based Syst..

[67]  Shiwali Mohan,et al.  Towards a Resource Aware Scheduler in Hadoop , 2017 .

[68]  Yaser Jararweh,et al.  A scalable Map Reduce tasks scheduling: a threading-based approach , 2016, IEEE CSE 2016.

[69]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[70]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[71]  Shikharesh Majumdar,et al.  Engineering resource management middleware for optimizing the performance of clouds processing mapreduce jobs with deadlines , 2014, ICPE.

[72]  N. Jawahar,et al.  Discrete Particle Swarm Optimization Algorithm for Flowshop Scheduling , 2009 .

[73]  Ciprian Dobre,et al.  MOMTH: multi-objective scheduling algorithm of many tasks in Hadoop , 2015, Cluster Computing.

[74]  Wolfgang Maass,et al.  Big Data and Theory , 2022, Encyclopedia of Big Data.

[75]  Wei Chen,et al.  MORM: A Multi-objective Optimized Replication Management strategy for cloud storage cluster , 2014, J. Syst. Archit..

[76]  Seung Ryoul Maeng,et al.  Locality-aware dynamic VM reconfiguration on MapReduce clouds , 2012, HPDC '12.

[77]  Luciana Arantes,et al.  MRA++: Scheduling and data placement on MapReduce for heterogeneous environments , 2015, Future Gener. Comput. Syst..

[78]  Deying Li,et al.  Makespan minimization for MapReduce systems with different servers , 2017, Future Gener. Comput. Syst..

[79]  S. D. Madhu Kumar,et al.  Curtailing job completion time in MapReduce clouds through improved Virtual Machine allocation , 2017, Comput. Electr. Eng..

[80]  Kenli Li,et al.  MTSD: A Task Scheduling Algorithm for MapReduce Base on Deadline Constraints , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[81]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[82]  Jiun-Long Huang,et al.  A load-aware scheduler for MapReduce framework in heterogeneous cloud environments , 2011, SAC '11.

[83]  Ali R. Butt,et al.  φ Sched : A Heterogeneity-Aware Hadoop Workflow Scheduler , 2014 .

[84]  NIDHI TIWARI,et al.  Classification Framework of MapReduce Scheduling Algorithms , 2015, ACM Comput. Surv..

[85]  Lin Li,et al.  Multi-modal Multimedia Big Data Analyzing Architecture and Resource Allocation on Cloud Platform , 2017, Neurocomputing.

[86]  Ying Li,et al.  A Power-Aware Scheduling of MapReduce Applications in the Cloud , 2011, 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing.

[87]  Hui Zhao,et al.  Prediction-Based and Locality-Aware Task Scheduling for Parallelizing Video Transcoding Over Heterogeneous MapReduce Cluster , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[88]  Vasudeva Varma,et al.  Job Aware Scheduling Algorithm for MapReduce Framework , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[89]  Anke Meyer-Bäse,et al.  Deep Learning in Medical Imaging: fMRI Big Data Analysis via Convolutional Neural Networks , 2018, PEARC.

[90]  Yang Gao,et al.  Adaptive grid job scheduling with genetic algorithms , 2005, Future Gener. Comput. Syst..

[91]  Kwang Mong Sim,et al.  A comparative review of job scheduling for MapReduce , 2011, 2011 IEEE International Conference on Cloud Computing and Intelligence Systems.