Improving Multisite Workflow Performance Using Model-Based Scheduling

Workflows play an important role in expressing and executing scientific applications. In recent years, a variety of computational sites and resources have emerged, and users often have access to multiple resources that are geographically distributed. These computational sites are heterogeneous in nature and performance of different tasks in a workflow varies from one site to another. Additionally, users typically have a limited resource allocation at each site. In such cases, judicious scheduling strategy is required in order to map tasks in the workflow to resources so that the workload is balanced among sites and the overhead is minimized in data transfer. Most existing systems either run the entire workflow in a single site or use naive approaches to distribute the tasks across sites or leave it to the user to optimize the allocation of tasks to distributed resources. This results in a significant loss in productivity for a scientist. In this paper, we propose a multi-site workflow scheduling technique that uses performance models to predict the execution time on different resources and dynamic probes to identify the achievable network throughput between sites. We evaluate our approach using real world applications in a distributed environment using the Swift distributed execution framework and show that our approach improves the execution time by up to 60% compared to the default schedule.

[1]  Jesús Labarta,et al.  A Framework for Performance Modeling and Prediction , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[2]  Jeff Weber,et al.  Workflow Management in Condor , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[3]  Eduardo Huedo,et al.  A framework for adaptive execution in grids , 2004, Softw. Pract. Exp..

[4]  Depei Qian,et al.  MapReduce Workload Modeling with Statistical Approach , 2011, Journal of Grid Computing.

[5]  Michael Wilde,et al.  Using multiple grid resources for bioinformatics applications in GADU , 2006 .

[6]  C LeeBenjamin,et al.  Accurate and efficient regression modeling for microarchitectural performance and power prediction , 2006 .

[7]  Justin M. Wozniak,et al.  Evaluating Cloud Computing Techniques for Smart Power Grid Design Using Parallel Scripting , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[8]  Ken Kennedy,et al.  An Implementation of Interprocedural Bounded Regular Section Analysis , 1991, IEEE Trans. Parallel Distributed Syst..

[9]  Martin Schulz,et al.  Modeling the performance of an algebraic multigrid cycle on HPC platforms , 2011, ICS '11.

[10]  Daniel S. Katz,et al.  Job and data clustering for aggregate use of multiple production cyberinfrastructures , 2012, DIDC '12.

[11]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[12]  Hugues Benoit-Cattin,et al.  Simulating Application Workflows and Services Deployed on the European Grid Infrastructure , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[13]  Mark Silberstein,et al.  Building an Online Domain-Specific Computing Service over Non-dedicated Grid and Cloud Resources: The Superlink-Online Experience , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[14]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[15]  Daniel S. Katz,et al.  Workflow task clustering for best effort systems with Pegasus , 2008, Mardi Gras Conference.

[16]  Daniel S. Katz,et al.  Evaluating storage systems for scientific data in the cloud , 2014, ScienceCloud '14.

[17]  David M. Brooks,et al.  Accurate and efficient regression modeling for microarchitectural performance and power prediction , 2006, ASPLOS XII.

[18]  Sally A. McKee,et al.  Methods of inference and learning for performance modeling of parallel applications , 2007, PPoPP.

[19]  Alex Rodriguez,et al.  Using multiple grid resources for bioinformatics applications in GADU , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[20]  Oliver Sinnen,et al.  Task Scheduling for Parallel Systems , 2007, Wiley series on parallel and distributed computing.

[21]  Venkatram Vishwanath,et al.  GROPHECY: GPU performance projection from CPU code skeletons , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[22]  Xingfu Wu,et al.  Prophesy: an infrastructure for performance analysis and modeling of parallel and grid applications , 2003, PERV.

[23]  Bertram Ludäscher,et al.  Scientific workflow management and the Kepler system: Research Articles , 2006 .

[24]  Kamel Fezzaa,et al.  Data intensive science at synchrotron based 3D x-ray imaging facilities , 2012, 2012 IEEE 8th International Conference on E-Science.

[25]  Ewa Deelman,et al.  WorkflowSim: A toolkit for simulating scientific workflows in distributed environments , 2012, 2012 IEEE 8th International Conference on E-Science.

[26]  Ronald L. Graham,et al.  Bounds for certain multiprocessing anomalies , 1966 .

[27]  Fabio A. González,et al.  BIGS: A framework for large-scale image processing and analysis over distributed and heterogeneous computing resources , 2012, 2012 IEEE 8th International Conference on E-Science.

[28]  A. Zunger,et al.  Self-interaction correction to density-functional approximations for many-electron systems , 1981 .

[29]  Douglas Thain,et al.  Makeflow: a portable abstraction for data intensive computing on clusters, clouds, and grids , 2012, SWEET '12.

[30]  Justin M. Wozniak,et al.  Coasters: Uniform Resource Provisioning and Access for Clouds and Grids , 2011, 2011 Fourth IEEE International Conference on Utility and Cloud Computing.

[31]  Venkatram Vishwanath,et al.  SKOPE: a framework for modeling and exploring workload behavior , 2014, Conf. Computing Frontiers.

[32]  P. Sadayappan,et al.  Scheduling of Parallel Jobs in a Heterogeneous Multi-site Environement , 2003, JSSPP.

[33]  Wei Guo,et al.  Joint scheduling for optical grid applications , 2007 .

[34]  Venkatram Vishwanath,et al.  Dataflow-driven GPU performance projection for multi-kernel transformations , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[35]  Daniel S. Katz,et al.  Swift: A language for distributed parallel scripting , 2011, Parallel Comput..

[36]  Rajkumar Kettimuthu,et al.  End-To-End Solution for Integrated Workload and Data Management using GlideinWMS and Globus Online , 2012 .

[37]  Ezio Bartocci,et al.  BioWMS: a web-based Workflow Management System for bioinformatics , 2007, BMC Bioinformatics.

[38]  Chunming Qiao,et al.  Demonstration of joint resource scheduling in an optical network integrated computing environment [Topics in Optical Communications] , 2010, IEEE Communications Magazine.

[39]  Martin Schulz,et al.  A regression-based approach to scalability prediction , 2008, ICS '08.

[40]  Alex Rodriguez,et al.  Extending the Galaxy portal with parallel and distributed execution capability , 2013 .

[41]  Sartaj Sahni,et al.  Workflow scheduling in e-Science networks , 2011, 2011 IEEE Symposium on Computers and Communications (ISCC).

[42]  Ken Kennedy,et al.  TaskScheduling Strategies forWorkflow-based Applications inGrids , 2005 .

[43]  P. Sadayappan,et al.  Distributed job scheduling on computational Grids using multiple simultaneous requests , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[44]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.