Storage-aware Algorithms for Scheduling of Workflow Ensembles in Clouds

This paper focuses on data-intensive workflows and addresses the problem of scheduling workflow ensembles under cost and deadline constraints in Infrastructure as a Service (IaaS) clouds. Previous research in this area ignores file transfers between workflow tasks, which, as we show, often have a large impact on workflow ensemble execution. In this paper we propose and implement a simulation model for handling file transfers between tasks, featuring the ability to dynamically calculate bandwidth and supporting a configurable number of replicas, thus allowing us to simulate various levels of congestion. The resulting model is capable of representing a wide range of storage systems available on clouds: from in-memory caches (such as memcached), to distributed file systems (such as NFS servers) and cloud storage (such as Amazon S3 or Google Cloud Storage). We observe that file transfers may have a significant impact on ensemble execution; for some applications up to 90 % of the execution time is spent on file transfers. Next, we propose and evaluate a novel scheduling algorithm that minimizes the number of transfers by taking advantage of data caching and file locality. We find that for data-intensive applications it performs better than other scheduling algorithms. Additionally, we modify the original scheduling algorithms to effectively operate in environments where file transfers take non-zero time.

[1]  Ewa Deelman,et al.  Partitioning and Scheduling Workflows across Multiple Sites with Storage Constraints , 2011, PPAM.

[2]  Xiao Liu,et al.  A cost-effective strategy for intermediate data storage in scientific cloud workflow systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[3]  S. Barnett,et al.  Philosophical Transactions of the Royal Society A : Mathematical , 2017 .

[4]  Marco Mellia,et al.  Cloud storage service benchmarking: Methodologies and experimentations , 2014, 2014 IEEE 3rd International Conference on Cloud Networking (CloudNet).

[5]  H. Howie Huang,et al.  TRACON: Interference-Aware Schedulingfor Data-Intensive Applicationsin Virtualized Environments , 2011, IEEE Transactions on Parallel and Distributed Systems.

[6]  Song Jiang,et al.  Characterizing Facebook's Memcached Workload , 2014, IEEE Internet Computing.

[8]  Daniel S. Katz,et al.  MTC envelope: defining the capability of large scale computers in the context of parallel scripting applications , 2013, HPDC.

[9]  C. Kesselman,et al.  CyberShake: A Physics-Based Seismic Hazard Model for Southern California , 2011 .

[10]  Asit Dan,et al.  An approximate analysis of the LRU and FIFO buffer replacement schemes , 1990, SIGMETRICS '90.

[11]  Ann L. Chervenak,et al.  Characterizing and profiling scientific workflows , 2013, Future Gener. Comput. Syst..

[12]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .

[13]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[14]  Ewa Deelman,et al.  Community Resources for Enabling Research in Distributed Scientific Workflows , 2014, 2014 IEEE 10th International Conference on e-Science.

[15]  Shishir Bharathi,et al.  Data Staging Strategies and Their Impact on the Execution of Scientific Workflows , 2009, DADC '09.

[16]  Jan Mendling,et al.  Cost-Efficient Scheduling of Elastic Processes in Hybrid Clouds , 2015, 2015 IEEE 8th International Conference on Cloud Computing.

[17]  Omer F. Rana,et al.  Enforcing QoS in scientific workflow systems enacted over Cloud infrastructures , 2012, J. Comput. Syst. Sci..

[18]  Ann L. Chervenak,et al.  Scheduling data-intensive workflows on storage constrained resources , 2009, WORKS '09.

[19]  Tak-Lon Wu,et al.  Scalable parallel computing on clouds using Twister4Azure iterative MapReduce , 2013, Future Gener. Comput. Syst..

[20]  Keith Beattie,et al.  Metrics for heterogeneous scientific workflows: A case study of an earthquake science application , 2011, Int. J. High Perform. Comput. Appl..

[21]  Remzi H. Arpaci-Dusseau Operating Systems: Three Easy Pieces , 2015, login Usenix Mag..

[22]  GhemawatSanjay,et al.  The Google file system , 2003 .

[23]  DeelmanEwa,et al.  Algorithms for cost- and deadline-constrained provisioning for scientific workflow ensembles in IaaS clouds , 2015 .

[24]  Li Zhao,et al.  SCEC CyberShake Workflows - Automating Probabilistic Seismic Hazard Analysis Calculations , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[25]  Hamid Arabnejad,et al.  List Scheduling Algorithm for Heterogeneous Systems by an Optimistic Cost Table , 2014, IEEE Transactions on Parallel and Distributed Systems.

[26]  Jie Li,et al.  Early observations on the performance of Windows Azure , 2010, HPDC '10.

[27]  Bora Uçar,et al.  Integrated data placement and task assignment for scientific workflows in clouds , 2011, DIDC '11.

[28]  Nelson Luis Saldanha da Fonseca,et al.  Scheduler for data-intensive workflows in public clouds , 2013, 2nd IEEE Latin American Conference on Cloud Computing and Communications.

[29]  G. Bruce Berriman,et al.  The Application of Cloud Computing to Astronomy: A Study of Cost and Performance , 2010, 2010 Sixth IEEE International Conference on e-Science Workshops.

[30]  Luiz Fernando Bittencourt,et al.  HCOC: a cost optimization algorithm for workflow scheduling in hybrid clouds , 2011, Journal of Internet Services and Applications.

[31]  Ewa Deelman,et al.  The application of cloud computing to scientific workflows: a study of cost and performance , 2013, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[32]  Marian Bubak,et al.  Scheduling Multilevel Deadline-Constrained Scientific Workflows on Clouds Based on Cost Optimization , 2015, Sci. Program..

[33]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[34]  Gilles Fedak,et al.  The Case for Workflow-Aware Storage:An Opportunity Study , 2015, Journal of Grid Computing.

[35]  Rizos Sakellariou,et al.  DAG Scheduling Using a Lookahead Variant of the Heterogeneous Earliest Finish Time Algorithm , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[36]  Ming Mao,et al.  A Performance Study on the VM Startup Time in the Cloud , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[37]  Ewa Deelman,et al.  Peer-to-Peer Data Sharing for Scientific Workflows on Amazon EC2 , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[38]  BubakMarian,et al.  Scheduling multilevel deadline-constrained scientific workflows on clouds based on cost optimization , 2015 .

[39]  Mehmet Balman,et al.  A new paradigm: Data-aware scheduling in grid computing , 2009, Future Gener. Comput. Syst..

[40]  Xiaorong Li,et al.  Multi-Objective Game Theoretic Schedulingof Bag-of-Tasks Workflows on Hybrid Clouds , 2014, IEEE Transactions on Cloud Computing.

[41]  Ewa Deelman,et al.  Experiences using cloud computing for a scientific workflow application , 2011, ScienceCloud '11.

[42]  Daniel S. Katz,et al.  Montage: a grid portal and software toolkit for science-grade astronomical image mosaicking , 2009, Int. J. Comput. Sci. Eng..

[43]  Shufen Zhang,et al.  Cloud Computing Research and Development Trend , 2010, 2010 Second International Conference on Future Networks.

[44]  M. Livny,et al.  High-Throughput, Kingdom-Wide Prediction and Annotation of Bacterial Non-Coding RNAs , 2008, PloS one.

[45]  Rajkumar Buyya,et al.  Deadline Based Resource Provisioningand Scheduling Algorithm for Scientific Workflows on Clouds , 2014, IEEE Transactions on Cloud Computing.

[46]  Kavitha Ranganathan,et al.  Simulation Studies of Computation and Data Scheduling Algorithms for Data Grids , 2003, Journal of Grid Computing.

[47]  Michael M. Swift,et al.  A Day Late and a Dollar Short: The Case for Research on Cloud Billing Systems , 2014, HotCloud.

[48]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[49]  Hector Garcia-Molina,et al.  Main Memory Database Systems: An Overview , 1992, IEEE Trans. Knowl. Data Eng..

[50]  Rizos Sakellariou,et al.  Scheduling Data-IntensiveWorkflows onto Storage-Constrained Distributed Resources , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[51]  G. Bruce Berriman,et al.  Data Sharing Options for Scientific Workflows on Amazon EC2 , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.