GEODIS: towards the optimization of data locality-aware job scheduling in geo-distributed data centers

Today, data-intensive applications rely on geographically distributed systems to leverage data collection, storing and processing. Data locality has been seen as a prominent technique to improve application performance and reduce the impact of network latency by scheduling jobs directly in the nodes hosting the data to be processed. MapReduce and Dryad are examples of frameworks which exploit locality by splitting jobs into multiple tasks that are dispatched to process portions of data locally. However, as the ecosystem of big data analysis has shifted from single clusters to span geo-distributed data centers, it is unavoidable that data may still be transferred through the network in order reduce the schedule length. Nevertheless, there is a lack of mechanism to efficiently blend data locality and inter-data center data transfer requirement in the existing scheduling techniques to address data-intensive processing across dispersed data centers. Therefore, the objective of this work is to propose and solve the makespan optimization problem for data-intensive job scheduling on geo-distributed data centers. To this end, we first formulate the task placement and the data access as a linear programming and use the GLPK solver to solve it. We then present a low complexity heuristic scheduling algorithm called GeoDis which allows data locality to cope with the data transfer requirement to achieve a greater performance on the makespan. The experiments with various realistic traces and synthetic generated workload show that GeoDis can reduce makespan of processing jobs by 44% as compared to the state-of-the-art algorithms and remain within $$91\%$$91% closer to the optimal solution by the LP solver.

[1]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[2]  Gabriel Antoniu,et al.  OverFlow: Multi-Site Aware Big Data Management for Scientific Workflows on Clouds , 2016, IEEE Transactions on Cloud Computing.

[3]  M. Zarina,et al.  Job scheduling for dynamic data replication strategy in heterogeneous federation data grid systems , 2013, 2013 Second International Conference on Informatics & Applications (ICIA).

[4]  Liming Zhu,et al.  A MapReduce Cluster Deployment Optimization Framework with Geo-distributed Data , 2015, 2015 IEEE 12th Intl Conf on Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications and Its Associated Workshops (UIC-ATC-ScalCom).

[5]  Changjun Jiang,et al.  Improving Performance of Heterogeneous MapReduce Clusters with Adaptive Task Tuning , 2017, IEEE Transactions on Parallel and Distributed Systems.

[6]  Jeffrey S. Chase,et al.  Provisioning and Evaluating Multi-domain Networked Clouds for Hadoop-based Applications , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[7]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Amit Kumar,et al.  Order Scheduling Models: Hardness and Algorithms , 2007, FSTTCS.

[10]  Carlo Curino,et al.  WANalytics: Geo-Distributed Analytics for a Data Intensive World , 2015, SIGMOD Conference.

[11]  Jun Luo,et al.  Flutter: Scheduling tasks closer to data across geo-distributed datacenters , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[12]  Rajiv Ranjan,et al.  G-Hadoop: MapReduce across distributed data centers for data-intensive computing , 2013, Future Gener. Comput. Syst..

[13]  Weixing Ji,et al.  An enforcement of real time scheduling in Spark Streaming , 2015, 2015 Sixth International Green and Sustainable Computing Conference (IGSC).

[14]  Zhuzhong Qian,et al.  Workload-Aware Scheduling Across Geo-distributed Data Centers , 2016, 2016 IEEE Trustcom/BigDataSE/ISPA.

[15]  Yun Yang,et al.  A Novel Cost-Effective Dynamic Data Replication Strategy for Reliability in Cloud Data Centres , 2011, 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing.

[16]  Herodotos Herodotou,et al.  No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics , 2011, SoCC.

[17]  Ishai Menache,et al.  Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can , 2015, Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication.

[18]  Lakshmi Ravi Anikode Integrated replication and scheduling in Data Grids with performance guarantee , 2011 .

[19]  Song Guo,et al.  Traffic-Aware Geo-Distributed Big Data Analytics with Predictable Job Completion Time , 2017, IEEE Transactions on Parallel and Distributed Systems.

[20]  Chenyu Wang,et al.  Exploring MapReduce efficiency with highly-distributed data , 2011, MapReduce '11.

[21]  Zhenni Li,et al.  Cost-Aware Streaming Workflow Allocation on Geo-Distributed Data Centers , 2017, IEEE Transactions on Computers.

[22]  Saeid Abrishami,et al.  Scheduling Data-Driven Workflows in Multi-cloud Environment , 2015, 2015 IEEE 7th International Conference on Cloud Computing Technology and Science (CloudCom).

[23]  Edward A. Lee,et al.  A Compile-Time Scheduling Heuristic for Interconnection-Constrained Heterogeneous Processor Architectures , 1993, IEEE Trans. Parallel Distributed Syst..

[24]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[25]  Linus Schrage,et al.  Letter to the Editor - A Proof of the Optimality of the Shortest Remaining Processing Time Discipline , 1968, Oper. Res..

[26]  Cristina L. Abad,et al.  DARE: Adaptive Data Replication for Efficient Cluster Scheduling , 2011, 2011 IEEE International Conference on Cluster Computing.

[27]  Albert Y. Zomaya,et al.  Practical Scheduling of Bag-of-Tasks Applications on Grids with Dynamic Resilience , 2007, IEEE Transactions on Computers.

[28]  Archana Ganapathi,et al.  The Case for Evaluating MapReduce Performance Using Workload Suites , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[29]  Ishfaq Ahmad,et al.  FASTEST: A Practical Low-Complexity Algorithm for Compile-Time Assignment of Parallel Programs to Multiprocessors , 1999, IEEE Trans. Parallel Distributed Syst..

[30]  Floriano Zini,et al.  Evaluating scheduling and replica optimisation strategies in OptorSim , 2003, Proceedings. First Latin American Web Congress.

[31]  Carlo Curino,et al.  Global Analytics in the Face of Bandwidth and Regulatory Constraints , 2015, NSDI.

[32]  Albert Y. Zomaya,et al.  Intelligent Scheduling and Replication in Datagrids: a Synergistic Approach , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[33]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[34]  Deep Medhi,et al.  Cost Efficient Design of Fault Tolerant Geo-Distributed Data Centers , 2017, IEEE Transactions on Network and Service Management.

[35]  Jemal H. Abawajy,et al.  Data Replication Approach with Consistency Guarantee for Data Grid , 2014, IEEE Transactions on Computers.

[36]  Incheon Paik,et al.  Investigation of network traffic in geo-distributed data centers , 2015, 2015 IEEE 7th International Conference on Awareness Science and Technology (iCAST).

[37]  Nam Thoai,et al.  An MILP-based makespan minimization model for single-machine scheduling problem with splitable jobs and availability constraints , 2013, 2013 International Conference on Computing, Management and Telecommunications (ComManTel).

[38]  Albert G. Greenberg,et al.  The cost of a cloud: research problems in data center networks , 2008, CCRV.

[39]  Minlan Yu,et al.  Scheduling jobs across geo-distributed datacenters , 2015, SoCC.

[40]  Rajkumar Buyya,et al.  A Fuzzy Logic-Based Controller for Cost and Energy Efficient Load Balancing in Geo-distributed Data Centers , 2015, 2015 IEEE/ACM 8th International Conference on Utility and Cloud Computing (UCC).

[41]  Margarida Mamede,et al.  PIXIDA: Optimizing Data Parallel Jobs in Wide-Area Data Analytics , 2015, Proc. VLDB Endow..

[42]  Wei Lin,et al.  StreamScope: Continuous Reliable Distributed Processing of Big Data Streams , 2016, NSDI.

[43]  Giuseppe Di Modica,et al.  Application profiling in hierarchical Hadoop for geo-distributed computing environments , 2016, 2016 IEEE Symposium on Computers and Communication (ISCC).

[44]  Ramesh K. Sitaraman,et al.  End-to-End Optimization for Geo-Distributed MapReduce , 2016, IEEE Transactions on Cloud Computing.

[45]  Paramvir Bahl,et al.  Low Latency Geo-distributed Data Analytics , 2015, SIGCOMM.

[46]  A. T. Haghighat,et al.  The impact of bandwidth and storage space on job scheduling and data replication strategies in data grids , 2012, 2012 8th International Conference on Computing Technology and Information Management (NCM and ICNIT).

[47]  Antony I. T. Rowstron,et al.  Bridging the tenant-provider gap in cloud services , 2012, SoCC '12.