The impact of data locality on the performance of a SaaS cloud with real-time data-intensive applications

As cloud computing continues to gain momentum, big data analytics are now offered as Software as a Service (SaaS). Besides the heterogeneity and multi-tenancy of the underlying virtualized environment, scheduling such real-time, data-intensive, embarrassingly parallel applications in a SaaS cloud involves another serious challenge: data locality. Consequently, data-aware scheduling policies should be employed, in order to effectively exploit data locality, while at the same time taking into account the other attributes of the workload and the characteristics of the resources. Towards this direction, we investigate via simulation the impact of data locality on the performance of a SaaS cloud, where real-time, data-intensive bags-of-tasks are scheduled dynamically, under various data availability conditions. A non-data-aware baseline scheduling policy is compared with two proposed data-aware heuristics, in an attempt to shed light on the effect of data locality awareness on the system performance.

[1]  N. B. Anuar,et al.  The rise of "big data" on cloud computing: Review and open research issues , 2015, Inf. Syst..

[2]  Jieun Choi,et al.  Data-Locality Aware Scientific Workflow Scheduling Methods in HPC Cloud Environments , 2016, International Journal of Parallel Programming.

[3]  Zhenhua Guo,et al.  Investigation of data locality and fairness in MapReduce , 2012, MapReduce '12.

[4]  Rajkumar Buyya,et al.  Power-aware provisioning of Cloud resources for real-time services , 2009, MGC '09.

[5]  Yong Zhao,et al.  Cloud Computing and Grid Computing 360-Degree Compared , 2008, GCE 2008.

[6]  Giorgio Buttazzo,et al.  Hard Real-Time Computing Systems: Predictable Scheduling Algorithms and Applications , 1997 .

[7]  Xiaomin Zhu,et al.  QoS-Aware Fault-Tolerant Scheduling for Real-Time Tasks on Heterogeneous Clusters , 2011, IEEE Transactions on Computers.

[8]  L Stavrinides Georgios,et al.  Scheduling real-time parallel applications in SaaS clouds in the presence of transient software failures , 2016 .

[9]  Rajiv Ranjan,et al.  G-Hadoop: MapReduce across distributed data centers for data-intensive computing , 2013, Future Gener. Comput. Syst..

[10]  Helen D. Karatza,et al.  A Cost-Effective and QoS-Aware Approach to Scheduling Real-Time Workflow Applications in PaaS and SaaS Clouds , 2015, 2015 3rd International Conference on Future Internet of Things and Cloud.

[11]  Helen D. Karatza,et al.  The Effect of Workload Computational Demand Variability on the Performance of a SaaS Cloud with a Multi-tier SLA , 2017, 2017 IEEE 5th International Conference on Future Internet of Things and Cloud (FiCloud).

[12]  Sachchidanand Singh,et al.  Big Data analytics , 2012 .

[13]  Raymond Namyst,et al.  Efficient Shared Memory Message Passing for Inter-VM Communications , 2008, Euro-Par Workshops.

[14]  Helen D. Karatza,et al.  The Impact of Input Error on the Scheduling of Task Graphs with Imprecise Computations in Heterogeneous Distributed Real-Time Systems , 2011, ASMTA.

[15]  Helen D. Karatza,et al.  The impact of resource heterogeneity on the timeliness of hard real-time complex jobs , 2014, PETRA '14.

[16]  G. Karagiannis,et al.  Cloud computing services: taxonomy and comparison , 2011, Journal of Internet Services and Applications.

[17]  Jesús Carretero,et al.  Different aspects of workflow scheduling in large-scale distributed systems , 2017, Simul. Model. Pract. Theory.

[18]  Melnned M. Kantardzic Big Data Analytics , 2013, Lecture Notes in Computer Science.

[19]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[20]  Chung Laung Liu,et al.  Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment , 1989, JACM.

[21]  Jan Broeckhove,et al.  Cost-Efficient Scheduling Heuristics for Deadline Constrained Workloads on Hybrid Clouds , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[22]  Rajkumar Buyya,et al.  SLA-Based Resource Scheduling for Big Data Analytics as a Service in Cloud Computing Environments , 2015, 2015 44th International Conference on Parallel Processing.

[23]  Insup Lee,et al.  An empirical analysis of scheduling techniques for real-time cloud-based data processing , 2011, 2011 IEEE International Conference on Service-Oriented Computing and Applications (SOCA).

[24]  Lisandro Zambenedetti Granville,et al.  Using Empirical Estimates of Effective Bandwidth in Network-Aware Placement of Virtual Machines in Datacenters , 2016, IEEE Transactions on Network and Service Management.

[25]  Alexandru Iosup,et al.  Performance Analysis of Cloud Computing Services for Many-Tasks Scientific Computing , 2011, IEEE Transactions on Parallel and Distributed Systems.

[26]  Georgios L. Stavrinides,et al.  Scheduling Different Types of Applications in a SaaS Cloud , 2016, BMSD 2016.

[27]  Jan Broeckhove,et al.  Online cost-efficient scheduling of deadline-constrained workloads on hybrid clouds , 2013, Future Gener. Comput. Syst..

[28]  Helen D. Karatza,et al.  Simulation-Based Performance Evaluation of an Energy-Aware Heuristic for the Scheduling of HPC Applications in Large-Scale Distributed Systems , 2017, ICPE Companion.

[29]  Rajkumar Buyya,et al.  Energy-Efficient Scheduling of Urgent Bag-of-Tasks Applications in Clouds through DVFS , 2014, 2014 IEEE 6th International Conference on Cloud Computing Technology and Science.

[30]  Domenico Talia,et al.  Clouds for Scalable Big Data Analytics , 2013, Computer.

[31]  Alexandru Iosup,et al.  A Performance Analysis of EC2 Cloud Computing Services for Scientific Computing , 2009, CloudComp.

[32]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.