Scientific Workflow Scheduling with Provenance Data in a Multisite Cloud

Recently, some Scientific Workflow Management Systems (SWfMSs) with provenance support (e.g. Chiron) have been deployed in the cloud. However, they typically use a single cloud site. In this paper, we consider a multisite cloud, where the data and computing resources are distributed at different sites (possibly in different regions). Based on a multisite architecture of SWfMS, i.e. multisite Chiron, and its provenance model, we propose a multisite task scheduling algorithm that considers the time to generate provenance data. We performed an extensive experimental evaluation of our algorithm using Microsoft Azure multisite cloud and two real-life scientific workflows (Buzz and Montage). The results show that our scheduling algorithm is up to 49.6% better than baseline algorithms in terms of total execution time.

[1]  Marta Mattoso,et al.  A Provenance-based Adaptive Scheduling Heuristic for Parallel Scientific Workflows in Clouds , 2012, Journal of Grid Computing.

[2]  Marta Mattoso,et al.  Algebraic dataflows for big data analysis , 2013, 2013 IEEE International Conference on Big Data.

[3]  Paul Watson,et al.  e-Science Central: Cloud-based e-Science and its application to chemical property modelling , 2010 .

[4]  Olaf Schenk,et al.  Two-level dynamic scheduling in PARDISO: Improved scalability on shared memory multiprocessing systems , 2002, Parallel Comput..

[5]  Reza Akbarinia,et al.  P2P Techniques for Decentralized Applications , 2012, Synthesis Lectures on Data Management.

[6]  Ian J. Taylor,et al.  Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..

[7]  Miron Livny,et al.  The cost of doing science on the cloud: The Montage example , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Patrick Valduriez,et al.  Memory-adaptive scheduling for large query execution , 1998, CIKM '98.

[9]  Subha Madhavan,et al.  A case study for cloud based high throughput analysis of NGS data using the globus genomics system , 2014, Computational and structural biotechnology journal.

[10]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[11]  Marta Mattoso,et al.  Scientific Workflow Partitioning in Multisite Cloud , 2014, Euro-Par Workshops.

[12]  Luc Bouganim,et al.  Dynamic query scheduling in data integration systems , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[13]  R. F. Freund,et al.  Dynamic matching and scheduling of a class of independent tasks onto heterogeneous computing systems , 1999, Proceedings. Eighth Heterogeneous Computing Workshop (HCW'99).

[14]  Marta Mattoso,et al.  A Survey of Data-Intensive Scientific Workflow Management , 2015, Journal of Grid Computing.

[15]  Marta Mattoso,et al.  Parallelization of Scientific Workflows in the Cloud , 2014 .

[16]  Paolo Missier,et al.  From Scripted HPC-Based NGS Pipelines to Workflows on the Cloud , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[17]  Rajkumar Buyya,et al.  Meeting Deadlines of Scientific Workflows in Public Clouds with Tasks Replication , 2014, IEEE Transactions on Parallel and Distributed Systems.

[18]  Patrick Valduriez,et al.  Principles of Distributed Database Systems, Third Edition , 2011 .

[19]  Kobra Etminani,et al.  A Min-Min Max-Min Selective Algorithm for Grid Task Scheduling , 2007, 2007 3rd IEEE/IFIP International Conference in Central Asia on Internet.

[20]  Xiaorong Li,et al.  Multi-Objective Game Theoretic Schedulingof Bag-of-Tasks Workflows on Hybrid Clouds , 2014, IEEE Transactions on Cloud Computing.

[21]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[22]  Radu Prodan,et al.  Scheduling of scientific workflows in the ASKALON grid environment , 2005, SGMD.

[23]  David Abramson,et al.  Scheduling Multiple Parameter Sweep Workflow Instances on the Grid , 2009, 2009 Fifth IEEE International Conference on e-Science.

[24]  Weisong Shi,et al.  An Adaptive Rescheduling Strategy for Grid Workflow Applications , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[25]  Dan Walsh,et al.  Design and implementation of the Sun network filesystem , 1985, USENIX Conference Proceedings.

[26]  Luc Quoniam,et al.  Intelligence obtained by applying data mining to a database of French theses on the subject of Brazil , 2001, Inf. Res..

[27]  Paul Watson,et al.  Developing cloud applications using the e-Science Central platform , 2013, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[28]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[29]  Marta Mattoso,et al.  Scientific Workflow Scheduling with Provenance Support in Multisite Cloud , 2016, VECPAR.

[30]  Marta Mattoso,et al.  Dynamic steering of HPC scientific workflows: A survey , 2015, Future Gener. Comput. Syst..

[31]  Marta Mattoso,et al.  Multi-objective scheduling of Scientific Workflows in multisite clouds , 2016, Future Gener. Comput. Syst..

[32]  Marta Mattoso,et al.  Chiron: a parallel engine for algebraic scientific workflows , 2013, Concurr. Comput. Pract. Exp..

[33]  Ricardo Jiménez-Peris,et al.  Scalable and topology-aware reconciliation on P2P networks , 2008, Distributed and Parallel Databases.

[34]  Gabriel Antoniu,et al.  Towards Multi-site Metadata Management for Geographically Distributed Cloud Workflows , 2015, 2015 IEEE International Conference on Cluster Computing.

[35]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .