A simulation toolkit to investigate the effects of grid characteristics on workflow completion time

Advances in technology and the increasing number and scale of compute resources have enabled larger computational science experiments and given researchers many choices of where and how to store data and perform computation. Analyzing the time to completion of their experiments is important for scientists to make the best use of both human and computational resources, but it is difficult to do in a comprehensive fashion because it involves experiment, system and user variables and their interactions with each configuration of systems. We present a simulation toolkit for analysis of computational science experiments and estimation of their time to completion. Our approach uses a minimal description of the experiment's workflow, and separate information about the systems being evaluated. We evaluate our approach using synthetic experiments that reflect actual workflow patterns, executed on systems from the NSF TeraGrid. Our evaluation focuses on ranking the available systems in order of expected experiment completion time. We show that with sufficient system information, the model can help investigate alternative systems and evaluate workflow bottlenecks. We also discuss the challenges posed by volatile queue wait time behavior, and suggest some methods to improve the accuracy of simulation for near-term workflow executions. We evaluate the impact of advance notice of predictable spikes in queue wait time due to down-time and reservations. We show that given advance notice, the probability of a correct ranking for a sample of synthetic workflows could increase from 59% to 74% or even 79%.

[1]  Martin Swany,et al.  Performance information services for computational Grids , 2004 .

[2]  Mei-Hui Su,et al.  Characterization of scientific workflows , 2008, 2008 Third Workshop on Workflows in Support of Large-Scale Science.

[3]  Warren Smith,et al.  Using Run-Time Predictions to Estimate Queue Wait Times and Improve Scheduler Performance , 1999, JSSPP.

[4]  Jeremy Kepner,et al.  Is 99% utilization of a supercomputer a good thing? , 2006, SC.

[5]  Nicholas J. Wright,et al.  WRF nature run , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[6]  Yves Caniou,et al.  Simbatch: An API for Simulating and Predicting the Performance of Parallel Resources Managed by Batch Systems , 2008, Euro-Par Workshops.

[8]  Richard Wolski,et al.  Predicting bounds on queuing delay for batch-scheduled parallel machines , 2006, PPoPP '06.

[9]  Shava Smallen,et al.  User-level grid monitoring with Inca 2 , 2007, GMW '07.

[10]  Dennis Gannon,et al.  Workflows for e-Science, Scientific Workflows for Grids , 2014 .

[11]  Bertram Ludäscher,et al.  Scientific workflow management and the Kepler system: Research Articles , 2006 .

[12]  Richard Wolski,et al.  The network weather service: a distributed resource performance forecasting service for metacomputing , 1999, Future Gener. Comput. Syst..

[13]  GRAS : A RESEARCH & DEVELOPMENT FRAMEWORK FOR GRID AND P 2 P INFRASTRUCTURES , 2006 .

[14]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[15]  Martin Quinson GRAS: a Research and Development Framework for Grid and P2P Infrastructures , 2006 .

[16]  Henri Casanova,et al.  SimGrid: A Generic Framework for Large-Scale Distributed Experiments , 2008, Tenth International Conference on Computer Modeling and Simulation (uksim 2008).

[17]  K. Kennedy,et al.  Evaluation of a Workflow Scheduler Using Integrated Performance Modelling and Batch Queue Wait Time Prediction , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[18]  N. Wolter,et al.  Beyond Performance Tools: Measuring and Modeling Productivity in HPC , 2007, Third International Workshop on Software Engineering for High Performance Computing Applications (SE-HPC '07).

[19]  Gregor von Laszewski,et al.  Swift: Fast, Reliable, Loosely Coupled Parallel Computation , 2007, 2007 IEEE Congress on Services (Services 2007).

[20]  Jeffrey S. Vetter,et al.  Wide-area performance profiling of 10GigE and InfiniBand technologies , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  G. Bryan,et al.  Introducing Enzo, an AMR Cosmology Application , 2004, astro-ph/0403044.

[22]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[23]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[24]  William E. Allcock,et al.  The Globus Striped GridFTP Framework and Server , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[25]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[26]  Richard Wolski,et al.  QBETS: queue bounds estimation from time series , 2007, SIGMETRICS '07.

[27]  A. Snavely,et al.  What ’ s working in HPC : Investigating HPC User Behavior and Productivity , 2006 .

[28]  Allen B. Downey Predicting queue times on space-sharing parallel computers , 1997, Proceedings 11th International Parallel Processing Symposium.

[29]  Andrew E. Johnson,et al.  Planetary-Scale Terrain Composition , 2009, IEEE Transactions on Visualization and Computer Graphics.

[30]  Hugh P. Bivens Grid Workflow , 2009, Encyclopedia of Database Systems.