PANORAMA: An approach to performance modeling and diagnosis of extreme-scale workflows

Computational science is well established as the third pillar of scientific discovery and is on par with experimentation and theory. However, as we move closer toward the ability to execute exascale calculations and process the ensuing extreme-scale amounts of data produced by both experiments and computations alike, the complexity of managing the compute and data analysis tasks has grown beyond the capabilities of domain scientists. Thus, workflow management systems are absolutely necessary to ensure current and future scientific discoveries. A key research question for these workflow management systems concerns the performance optimization of complex calculation and data analysis tasks. The central contribution of this article is a description of the PANORAMA approach for modeling and diagnosing the run-time performance of complex scientific workflows. This approach integrates extreme-scale systems testbed experimentation, structured analytical modeling, and parallel systems simulation into a comprehensive workflow framework called Pegasus for understanding and improving the overall performance of complex scientific workflows.

[1]  Mei-Hui Su,et al.  Characterization of scientific workflows , 2008, 2008 Third Workshop on Workflows in Support of Large-Scale Science.

[2]  Alexandru Iosup,et al.  The Grid Workloads Archive , 2008, Future Gener. Comput. Syst..

[3]  Holger Gohlke,et al.  Amber 2015, University of California, San Francisco , 2015 .

[4]  Carl Kesselman,et al.  Application-Level Resource Provisioning on the Grid , 2006, 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).

[5]  Tristan Glatard,et al.  A Science-Gateway Workload Archive to Study Pilot Jobs, User Activity, Bag of Tasks, Task Sub-steps, and Workflow Executions , 2012, Euro-Par Workshops.

[6]  Yolanda Gil,et al.  Pegasus: Planning for Execution in Grids , 2002 .

[7]  Wil M. P. van der Aalst,et al.  Workflow Exception Patterns , 2006, CAiSE.

[8]  Ann L. Chervenak,et al.  Data Management Challenges of Data-Intensive Scientific Workflows , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[9]  Marty Humphrey,et al.  Auto-scaling to minimize cost and meet application deadlines in cloud workflows , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Marta Mattoso,et al.  A Lightweight Middleware Monitor for Distributed Scientific Workflows , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[11]  Alexandru Iosup,et al.  A Trace-Based Investigation Of The Characteristics Of Grid Workflows , 2008 .

[12]  Jeffrey O. Kephart,et al.  The Vision of Autonomic Computing , 2003, Computer.

[13]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[14]  S. Mahambre,et al.  Workload Characterization for Capacity Planning and Performance Management in IaaS Cloud , 2012, 2012 IEEE International Conference on Cloud Computing in Emerging Markets (CCEM).

[15]  Yuyu Yin,et al.  Testbeds and Research Infrastructures for the Development of Networks and Communities , 2018, Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering.

[16]  Douglas Thain,et al.  Practical Resource Monitoring for Robust High Throughput Computing , 2015, 2015 IEEE International Conference on Cluster Computing.

[17]  Jeffrey S. Vetter,et al.  Aspen: A domain specific language for performance modeling , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Jeffrey S. Chase,et al.  ExoGENI: A Multi-Domain Infrastructure-as-a-Service Testbed , 2012, The GENI Book.

[19]  Richard W. Vuduc,et al.  On the communication complexity of 3D FFTs and its implications for Exascale , 2012, ICS '12.

[20]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[21]  Christopher D. Carothers,et al.  Efficient optimistic parallel simulations using reverse computation , 1999, Proceedings Thirteenth Workshop on Parallel and Distributed Simulation. PADS 99. (Cat. No.PR00155).

[22]  Ewa Deelman,et al.  A Cleanup Algorithm for Implementing Storage Constraints in Scientific Workflow Executions , 2014, 2014 9th Workshop on Workflows in Support of Large-Scale Science.

[23]  Jeffrey S. Vetter,et al.  Modeling synthetic aperture radar computation with Aspen , 2013, Int. J. High Perform. Comput. Appl..

[24]  P. F. Peterson,et al.  Mantid - Data Analysis and Visualization Package for Neutron Scattering and $μ SR$ Experiments , 2014, 1407.5860.

[25]  Robert B. Ross,et al.  On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[26]  Ian J. Taylor,et al.  Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..

[27]  David L. Hart Measuring TeraGrid: workload characterization for a high-performance computing federation , 2011, Int. J. High Perform. Comput. Appl..

[28]  Helgi Adalsteinsson,et al.  Using simulation to design extremescale applications and architectures: programming model exploration , 2011, PERV.

[29]  Bruce Jacob,et al.  The structural simulation toolkit , 2006, PERV.

[30]  Alexandru Iosup,et al.  Grid Computing Workloads , 2011, IEEE Internet Computing.

[31]  Tristan Glatard,et al.  Controlling fairness and task granularity in distributed, online, non‐clairvoyant workflow executions , 2014, Concurr. Comput. Pract. Exp..

[32]  Weisong Shi,et al.  Workload characterization on a production Hadoop cluster: A case study on Taobao , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[33]  Ian J. Taylor,et al.  A Case Study into Using Common Real-Time Workflow Monitoring Infrastructure for Scientific Workflows , 2013, Journal of Grid Computing.

[34]  Michael Wilde,et al.  Kickstarting remote applications , 2006 .

[35]  Lavanya Ramakrishnan,et al.  A Survey of Distributed Workflow Characteristics and Resource Requirements , 2008 .

[36]  Christopher D. Carothers,et al.  Warp speed: executing time warp on 1,966,080 cores , 2013, SIGSIM-PADS.

[37]  Christopher D. Carothers,et al.  Scalable Time Warp on Blue Gene Supercomputers , 2009, 2009 ACM/IEEE/SCS 23rd Workshop on Principles of Advanced and Distributed Simulation.

[38]  Alexandru Iosup,et al.  The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[39]  Radu Prodan,et al.  ON THE CHARACTERISTICS OF GRID WORKFLOWS , 2008 .

[40]  Ewa Deelman,et al.  Failure prediction and localization in large scientific workflows , 2011, WORKS '11.

[41]  Robert B. Ross,et al.  Modeling a Million-Node Dragonfly Network Using Massively Parallel Discrete-Event Simulation , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[42]  Ewa Deelman,et al.  Online Fault and Anomaly Detection for Large-Scale Scientific Workflows , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[43]  Ann L. Chervenak,et al.  Characterizing and profiling scientific workflows , 2013, Future Gener. Comput. Syst..

[44]  Ewa Deelman,et al.  Community Resources for Enabling Research in Distributed Scientific Workflows , 2014, 2014 IEEE 10th International Conference on e-Science.

[45]  Li Zhao,et al.  Managing Large-Scale Workflow Execution from Resource Provisioning to Provenance Tracking: The CyberShake Example , 2006, 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).

[46]  Yufeng Xin,et al.  Enabling persistent queries for cross-aggregate performance monitoring , 2014, IEEE Communications Magazine.

[47]  Ewa Deelman,et al.  Rethinking data management for big data scientific workflows , 2013, 2013 IEEE International Conference on Big Data.

[48]  Seyong Lee,et al.  COMPASS: A Framework for Automated Performance Modeling and Prediction , 2015, ICS.

[49]  Alexandru Iosup,et al.  Workflow Monitoring and Analysis Tool for ASKALON , 2008, CoreGRID Workshop on Grid Middleware.

[50]  Yufeng Xin,et al.  Evaluating I/O aware network management for scientific workflows on networked clouds , 2013, NDM '13.

[51]  Miron Livny,et al.  Pegasus, a workflow management system for science automation , 2015, Future Gener. Comput. Syst..

[52]  William Gropp,et al.  An introductory exascale feasibility study for FFTs and multigrid , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[53]  Brian Tierney,et al.  Instantiating a Global Network Measurement Framework , 2008 .

[54]  Laxmikant V. Kalé,et al.  Scalable molecular dynamics with NAMD , 2005, J. Comput. Chem..

[55]  Hong Linh Truong,et al.  SCALEA-G: A Unified Monitoring and Performance Analysis System for the Grid , 2004, European Across Grids Conference.

[56]  Antoine H. C. van Kampen,et al.  Characterizing workflow-based activity on a production e-infrastructure using provenance data , 2013, Future Gener. Comput. Syst..

[57]  Jeremy C. Smith,et al.  Sassena - X-ray and neutron scattering calculated from molecular dynamics trajectories using massively parallel computers , 2012, Comput. Phys. Commun..

[58]  Radu Prodan,et al.  Dynamic Cloud provisioning for scientific Grid workflows , 2010, 2010 11th IEEE/ACM International Conference on Grid Computing.

[59]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[60]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[61]  Douglas Thain,et al.  Toward fine-grained online task characteristics estimation in scientific workflows , 2013, WORKS@SC.

[62]  Christopher D. Carothers,et al.  On deciding between conservative and optimistic approaches on massively parallel platforms , 2010, Proceedings of the 2010 Winter Simulation Conference.

[63]  Matthew Mathis,et al.  The macroscopic behavior of the TCP congestion avoidance algorithm , 1997, CCRV.

[64]  Tristan Glatard,et al.  Self-healing of workflow activity incidents on distributed computing infrastructures , 2013, Future Gener. Comput. Syst..

[65]  D. Martin Swany,et al.  Online workflow management and performance analysis with Stampede , 2011, 2011 7th International Conference on Network and Service Management.

[66]  Kenneth W. Herwig,et al.  The Spallation Neutron Source in Oak Ridge: A powerful tool for materials research , 2006 .

[67]  Rizos Sakellariou,et al.  Using imbalance metrics to optimize task clustering in scientific workflow executions , 2015, Future Gener. Comput. Syst..