Storage-Tag-Aware Scheduler for Hadoop Cluster

Big data analytics has simplified the processing complexity of extremely large data sets through ecosystems, such as Hadoop, MapR, and Cloudera. Apache Hadoop is an open-source ecosystem that manages large data sets in a distributed environment. MapReduce is a programming model that processes massive amount of unstructured data sets over Hadoop cluster. Recently, Hadoop enhances its homogeneous storage function to heterogeneous storage and stores data sets into multiple storage media, i.e., SSD, RAM, and DISK. This development increases the performance of data block placement strategy and allows a client to store large data sets into multiple storage media efficiently than homogeneous storage. However, this evolution increases the consumption of computing capacity and memory usage over MapReduce job scheduling. The scheduler processes MapReduce job into homogeneous container having configuration of CPU, memory, DISK volume, and network I/O, and accesses, processes, and stores data sets over heterogeneous storage media. This produces a processing latency of locating and pairing source data set to MapReduce tasks and results an abnormal high consumption of computing capacity and memory usage in a Datanode. Similarly, when scheduler assigns MapReduce jobs to multiple Datanodes, the same processing latency can severely affect the performance of whole cluster. In this paper, we present Storage-Tag-Aware Scheduler (STAS) that reduces processing latency by scheduling MapReduce jobs into heterogeneous storage containers, i.e., SSD, DISK, and RAM container. STAS endorses job with a tag of storage media, such as $Job_{SSD}$ , $Job_{DISK}$ , and $Job_{RAM}$ and parses them into heterogeneous shared-queues, which assign processing configuration to enlist jobs. STAS manager then schedules shared-queue jobs into heterogeneous MapReduce containers and generates an output into storage media of the cluster. The experimental evaluation shows that STAS optimizes the consumption of computing capacity and memory usage efficiently than available schedulers in a Hadoop cluster.

[1]  Rajkumar Buyya,et al.  Heterogeneity in Mobile Cloud Computing: Taxonomy and Open Challenges , 2014, IEEE Communications Surveys & Tutorials.

[2]  Dong Ryeol Shin,et al.  RDP: A storage-tier-aware Robust Data Placement strategy for Hadoop in a Cloud-based Heterogeneous Environment , 2016, KSII Trans. Internet Inf. Syst..

[3]  Fang Dong,et al.  BAR: An Efficient Data Locality Driven Task Scheduling Algorithm for Cloud Computing , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[4]  Xiaowei Liu,et al.  Multiple-Job Optimization in MapReduce for Heterogeneous Workloads , 2010, 2010 Sixth International Conference on Semantics, Knowledge and Grids.

[5]  P. K. Mishra,et al.  Review of Apriori Based Algorithms on MapReduce Framework , 2017, ArXiv.

[6]  Changjun Jiang,et al.  Improving Performance of Heterogeneous MapReduce Clusters with Adaptive Task Tuning , 2017, IEEE Transactions on Parallel and Distributed Systems.

[7]  Rong Gu,et al.  SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters , 2014, J. Parallel Distributed Comput..

[8]  Feng Xia,et al.  MapReduce: Review and open challenges , 2016, Scientometrics.

[9]  Aditya B. Patel,et al.  Addressing big data problem using Hadoop and Map Reduce , 2012, 2012 Nirma University International Conference on Engineering (NUiCONE).

[10]  Sherif Sakr,et al.  Big Data 2.0 Processing Systems: Taxonomy and Open Challenges , 2016, Journal of Grid Computing.

[11]  Yookun Cho,et al.  An efficient Hadoop data replication method design for heterogeneous clusters , 2016, SAC.

[12]  Dhabaleswar K. Panda,et al.  Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[13]  Ali Raza Butt,et al.  VENU: Orchestrating SSDs in hadoop storage , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[14]  Y. V. Lokeswari,et al.  A Comparative study on Parallel Data Mining Algorithms using Hadoop Map Reduce: A Survey , 2016, ICTCS.

[15]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[17]  Jordi Torres,et al.  Adaptive MapReduce Scheduling in Shared Environments , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[18]  Changjun Jiang,et al.  Cross-Platform Resource Scheduling for Spark and MapReduce on YARN , 2017, IEEE Transactions on Computers.

[19]  Deepali Vora,et al.  YARN versus MapReduce — A comparative study , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[20]  Dhabaleswar K. Panda,et al.  High-Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[21]  Ali Raza Butt,et al.  hatS: A Heterogeneity-Aware Tiered Storage for Hadoop , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[22]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[23]  Seyed Reza Pakize A Comprehensive View of Hadoop MapReduce Scheduling Algorithms , 2014 .

[24]  M. Senthilkumar,et al.  A Survey on Job Scheduling in Big Data , 2016 .

[25]  Dhabaleswar K. Panda,et al.  Efficient data access strategies for Hadoop and Spark on HPC cluster with heterogeneous storage , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[26]  Chen He,et al.  Matchmaking: A New MapReduce Scheduling Technique , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[27]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[28]  Bhavin J. Mathiya,et al.  Apache Hadoop Yarn Parameter configuration Challenges and Optimization , 2015, 2015 International Conference on Soft-Computing and Networks Security (ICSNS).

[29]  Evgenia Smirni,et al.  DyScale: A MapReduce Job Scheduler for Heterogeneous Multicore Processors , 2017, IEEE Transactions on Cloud Computing.

[30]  Rodrigo Fonseca,et al.  Retro: Targeted Resource Management in Multi-tenant Distributed Systems , 2015, NSDI.

[31]  Yuping Wang,et al.  Energy-Efficient Multi-Job Scheduling Model for Cloud Computing and Its Genetic Algorithm , 2012 .

[32]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.