SPO: A Secure and Performance-aware Optimization for MapReduce Scheduling

Abstract MapReduce is a common framework that effectively processes multi-petabyte data in a distributed manner. Therefore, MapReduce is widely used in heterogeneous environments, such as cloud, to provide performance adequate for system needs. Despite the MapReduce benefits, tweaking the system configuration to achieve the maximum performance is still challenging and needs deep expertise. Besides, some new MapReduce security issues, which has not been well-addressed yet, are recently raised. In this paper, we present a performance-aware and secure framework, named S P O , to minimize the makespan of the tasks while considering task security constraints. Inspired by the H E F T algorithm, first, we introduce S P O , which proposes a two-stage static scheduler in Map and Reduce phases, respectively, to minimize makespan while considering network traffic. Plus, S P O ∗ introduces a mathematical optimization model of the proposed scheduler aiming to estimate the system performance while considering security constraints with an error of less than 2%. The experimental results demonstrate that S P O outperforms Hadoop-stock in terms of makespan and network traffic by 29% and 31%, respectively, for the tasks running in heterogeneous environments.

[1]  Cevdet Aykanat,et al.  Locality-aware and load-balanced static task scheduling for MapReduce , 2019, Future Gener. Comput. Syst..

[2]  Mauro Conti,et al.  TMaR: a two-stage MapReduce scheduler for heterogeneous environments , 2020, Hum. centric Comput. Inf. Sci..

[3]  Xiao Qin,et al.  Scheduling security-critical real-time applications on clusters , 2006, IEEE Transactions on Computers.

[4]  Achmad Nizar Hidayanto,et al.  Data protection on hadoop distributed file system by using encryption algorithms: A systematic literature review , 2020 .

[5]  Amanullah Yasin,et al.  DDoS attacks analysis in bigdata (hadoop) environment , 2018, 2018 15th International Bhurban Conference on Applied Sciences and Technology (IBCAST).

[6]  Ciprian Dobre,et al.  MOMC: Multi-objective and Multi-constrained Scheduling Algorithm of Many Tasks in Hadoop , 2014, 2014 Ninth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing.

[7]  Priya P. Sharma Securing Big Data Hadoop : A Review of Security Issues , Threats and Solution , 2014 .

[8]  Atul Negi,et al.  A data locality based scheduler to enhance MapReduce performance in heterogeneous environments , 2019, Future Gener. Comput. Syst..

[9]  Roy H. Campbell,et al.  Play It Again, SimMR! , 2011, 2011 IEEE International Conference on Cluster Computing.

[10]  Wei Hu,et al.  Distributed task scheduling with security and outage constraints in MapReduce , 2017, 2017 IEEE 21st International Conference on Computer Supported Cooperative Work in Design (CSCWD).

[11]  Sofiène Tahar,et al.  Task Scheduling in Big Data Platforms: A Systematic Literature Review , 2017, J. Syst. Softw..

[12]  Takahiro Hara,et al.  A Multi-Objective Optimization Scheduling Method Based on the Ant Colony Algorithm in Cloud Computing , 2015, IEEE Access.

[13]  Yi Yao,et al.  New Scheduling Algorithms for Improving Performance and Resource Utilization in Hadoop YARN Clusters , 2019 .

[14]  Sabeur Aridhi,et al.  An experimental survey on big data frameworks , 2016, Future Gener. Comput. Syst..

[15]  Mauro Conti,et al.  MapReduce: an infrastructure review and research insights , 2019, The Journal of Supercomputing.

[16]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[17]  Weikuan Yu,et al.  FARMS: Efficient mapreduce speculation for failure recovery in short jobs , 2017, Parallel Comput..

[18]  Kalka Dubey,et al.  Modified HEFT Algorithm for Task Scheduling in Cloud Environment , 2018 .

[19]  Maozhen Li,et al.  gSched: a resource aware Hadoop scheduler for heterogeneous cloud computing environments , 2017, Concurr. Comput. Pract. Exp..

[20]  D. Sumathi,et al.  Improving Efficiency of HEFT Scheduling Algorithm in Cloud Environment , 2018 .

[21]  Claude Tadonki,et al.  E-HEFT: Enhancement Heterogeneous Earliest Finish Time algorithm for Task Scheduling based on Load Balancing in Cloud Computing , 2018, 2018 International Conference on High Performance Computing & Simulation (HPCS).

[22]  Luís Veiga,et al.  An Adaptive Distributed Simulator for Cloud and MapReduce Algorithms and Architectures , 2014, 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing.

[23]  Lei Ying,et al.  MapTask Scheduling in MapReduce With Data Locality: Throughput and Heavy-Traffic Optimality , 2013, IEEE/ACM Transactions on Networking.

[24]  Ashutosh Kumar Singh,et al.  Dynamic data leakage detection model based approach for MapReduce computational security in cloud , 2016, 2016 Fifth International Conference on Eco-friendly Computing and Communication Systems (ICECCS).

[25]  Chris Jermaine,et al.  Online aggregation for large MapReduce jobs , 2011, Proc. VLDB Endow..

[26]  Ya-Shu Chen,et al.  Data-locality-aware mapreduce real-time scheduling framework , 2016, J. Syst. Softw..

[27]  Mauro Conti,et al.  SoFA: A Spark-oriented Fog Architecture , 2019, IECON 2019 - 45th Annual Conference of the IEEE Industrial Electronics Society.

[28]  NIDHI TIWARI,et al.  Classification Framework of MapReduce Scheduling Algorithms , 2015, ACM Comput. Surv..

[29]  Deying Li,et al.  Minimizing makespan and total completion time in MapReduce-like systems , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[30]  Anand Paul,et al.  MapReduce Scheduler to Minimize the Size of Intermediate Data in Shuffle Phase , 2019, 2019 IEEE/ACIS 18th International Conference on Computer and Information Science (ICIS).

[31]  Hwangnam Kim,et al.  MR-CloudSim: Designing and implementing MapReduce computing model on CloudSim , 2012, 2012 International Conference on ICT Convergence (ICTC).

[32]  Roy H. Campbell,et al.  Orchestrating an Ensemble of MapReduce Jobs for Minimizing Their Makespan , 2013, IEEE Transactions on Dependable and Secure Computing.

[33]  Ehud Gudes,et al.  Security and privacy aspects in MapReduce on clouds: A survey , 2016, Comput. Sci. Rev..

[34]  Arun Kumar Sangaiah,et al.  Multi-objective scheduling of MapReduce jobs in big data processing , 2018, Multimedia Tools and Applications.

[35]  Ciprian Dobre,et al.  MOMTH: multi-objective scheduling algorithm of many tasks in Hadoop , 2015, Cluster Computing.

[36]  Rizos Sakellariou,et al.  DAG Scheduling Using a Lookahead Variant of the Heterogeneous Earliest Finish Time Algorithm , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[37]  Deying Li,et al.  Makespan minimization for MapReduce systems with different servers , 2017, Future Gener. Comput. Syst..

[38]  Nilay Khare,et al.  Enhanced Secured Map Reduce layer for Big Data privacy and security , 2019, J. Big Data.

[39]  Amin Vahdat,et al.  Hedera: Dynamic Flow Scheduling for Data Center Networks , 2010, NSDI.

[40]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[41]  Benjamin Hindman,et al.  Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[42]  日経BP社,et al.  Amazon Web Services完全ソリューションガイド , 2016 .

[43]  Mostafa Azizi,et al.  Log files Analysis Using MapReduce to Improve Security , 2019 .

[44]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[45]  Sarbjeet Singh,et al.  A review of metaheuristic scheduling techniques in cloud computing , 2015 .

[46]  Sudipta Roy,et al.  Large-Scale Encryption in the Hadoop Environment: Challenges and Solutions , 2017, IEEE Access.

[47]  Scott Shenker,et al.  Network Requirements for Resource Disaggregation , 2016, OSDI.

[48]  LiGuo Huang,et al.  A security and cost aware scheduling algorithm for heterogeneous tasks of scientific workflow in clouds , 2016, Future Gener. Comput. Syst..

[49]  Kenli Li,et al.  A genetic algorithm for task scheduling on heterogeneous computing systems using multiple priority queues , 2014, Inf. Sci..

[50]  Ting Yu,et al.  SecureMR: A Service Integrity Assurance Framework for MapReduce , 2009, 2009 Annual Computer Security Applications Conference.

[51]  Song Guo,et al.  Cluster Frameworks for Efficient Scheduling and Resource Allocation in Data Center Networks: A Survey , 2018, IEEE Communications Surveys & Tutorials.

[52]  Nima Jafari Navimipour,et al.  A taxonomy of software-based and hardware-based approaches for energy efficiency management in the Hadoop , 2019, J. Netw. Comput. Appl..

[53]  Roy H. Campbell,et al.  Resource Provisioning Framework for MapReduce Jobs with Performance Goals , 2011, Middleware.

[54]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[55]  L. S. S. Reddy,et al.  Survey on Improved Scheduling in Hadoop MapReduce in Cloud Environments , 2012, ArXiv.

[56]  Ratnadeep R. Deshmukh,et al.  A Comparative Approach to Secure Data Storage Model in Hadoop Framework , 2020 .

[57]  Rui Zhao,et al.  SOMR: Towards a Security-Oriented MapReduce Infrastructure , 2017, 2017 IEEE Trustcom/BigDataSE/ICESS.

[58]  Xiaobo Zhou,et al.  iShuffle: Improving Hadoop Performance with Shuffle-on-Write , 2017, IEEE Transactions on Parallel and Distributed Systems.

[59]  Tingting Wang,et al.  Load Balancing Task Scheduling Based on Genetic Algorithm in Cloud Computing , 2014, 2014 IEEE 12th International Conference on Dependable, Autonomic and Secure Computing.

[60]  Kenli Li,et al.  An optimized MapReduce workflow scheduling algorithm for heterogeneous computing , 2016, The Journal of Supercomputing.

[61]  Mauro Conti,et al.  POSTER: An Intelligent Framework to Parallelize Hadoop Phases , 2018, HPDC.

[62]  Vitaly Shmatikov,et al.  Airavat: Security and Privacy for MapReduce , 2010, NSDI.

[63]  Rajkumar Buyya,et al.  CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms , 2011, Softw. Pract. Exp..

[64]  Alberto Abelló,et al.  MapReduce Performance Models for Hadoop 2.x , 2017, EDBT/ICDT Workshops.

[65]  Rajkumar Buyya,et al.  HScheduler: an optimal approach to minimize the makespan of multiple MapReduce jobs , 2016, The Journal of Supercomputing.

[66]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[67]  Rajkumar Buyya,et al.  SLA-Aware Provisioning and Scheduling of Cloud Resources for Big Data Analytics , 2014, 2014 IEEE International Conference on Cloud Computing in Emerging Markets (CCEM).

[68]  Feng Li,et al.  SLA-aware energy-efficient scheduling scheme for Hadoop YARN , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[69]  Hong Liu,et al.  QL-HEFT: a novel machine learning scheduling scheme base on cloud computing environment , 2019, Neural Computing and Applications.

[70]  Xin-She Yang,et al.  Metaheuristic Optimization: Algorithm Analysis and Open Problems , 2011, SEA.

[71]  T.C.E. Cheng,et al.  Optimal online algorithms for MapReduce scheduling on two uniform machines , 2019, Optim. Lett..

[72]  David A. Maltz,et al.  Surviving failures in bandwidth-constrained datacenters , 2012, CCRV.

[73]  Aditya Akella,et al.  Altruistic Scheduling in Multi-Resource Clusters , 2016, OSDI.