Modular performance prediction for scientific workflows using Machine Learning

Abstract Scientific workflows provide an opportunity for declarative computational experiment design in an intuitive and efficient way. A distributed workflow is typically executed on a variety of resources, and it uses a variety of computational algorithms or tools to achieve the desired outcomes. Such a variety imposes additional complexity in scheduling these workflows on large scale computers. As computation becomes more distributed, insights into expected workload that a workflow presents become critical for effective resource allocation. In this paper, we present a modular framework that leverages Machine Learning for creating precise performance predictions of a workflow. The central idea is to partition a workflow in such a way that makes the task of forecasting each atomic unit manageable and gives us a way to combine the individual predictions efficiently. We recognize a combination of an executable and a specific physical resource as a single module. This gives us a handle to characterize workload and machine power as a single unit of prediction. Our modular technique of creating atomic modules and deployment of longest-path approach to estimate workflow performance, allows the framework to adapt to highly complex nested directed acyclic workflows and scale to new scenarios, since it does not make assumptions of underlying workflow structure. We present performance estimation results of independent workflow modules executed on the XSEDE SDSC Comet cluster using various Machine Learning algorithms. The results provide insights into the behavior and effectiveness of different algorithms in the context of scientific workflow performance prediction.

[1]  Johan Montagnat,et al.  A Probabilistic Model to Analyse Workflow Performance on Production Grids , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[2]  Jianwu Wang,et al.  Kepler + CometCloud: Dynamic Scientific Workflow Execution on Federated Cloud Resources , 2016, ICCS.

[3]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[4]  David M. Brooks,et al.  Accurate and efficient regression modeling for microarchitectural performance and power prediction , 2006, ASPLOS XII.

[5]  Richard Gibbons,et al.  A Historical Application Profiler for Use by Parallel Schedulers , 1997, JSSPP.

[6]  Douglas Thain,et al.  Toward fine-grained online task characteristics estimation in scientific workflows , 2013, WORKS@SC.

[7]  Kaushik Dutta,et al.  Modeling virtualized applications using machine learning techniques , 2012, VEE '12.

[8]  Michael F. P. O'Boyle,et al.  Milepost GCC: Machine Learning Enabled Self-tuning Compiler , 2011, International Journal of Parallel Programming.

[9]  Lieven Eeckhout,et al.  Performance prediction based on inherent program similarity , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[10]  Ann L. Chervenak,et al.  Characterizing and profiling scientific workflows , 2013, Future Gener. Comput. Syst..

[11]  Vladimir Vapnik,et al.  Support-vector networks , 2004, Machine Learning.

[12]  Xingfu Wu,et al.  Prophesy: an infrastructure for performance analysis and modeling of parallel and grid applications , 2003, PERV.

[13]  Robert D. van der Mei,et al.  Effective Prediction of Job Processing Times in a Large-Scale Grid Environment , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[14]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[15]  Thomas Fahringer,et al.  Predicting the execution time of grid workflow applications through local learning , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[17]  Ilkay Altintas,et al.  Biomedical Big Data Training Collaborative (BBDTC): An effort to bridge the talent gap in biomedical science and research , 2017, J. Comput. Sci..

[18]  Paolo Missier,et al.  Predicting the Execution Time of Workflow Activities Based on Their Input Features , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[19]  Alan Jay Smith,et al.  Analysis of benchmark characteristics and benchmark performance prediction , 1996, TOCS.

[20]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[21]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[22]  José A. B. Fortes,et al.  On the Use of Machine Learning to Predict the Time and Resources Consumed by Applications , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[23]  Liping Zhang,et al.  A multi-strategy collaborative prediction model for the runtime of online tasks in computing cluster/grid , 2010, Cluster Computing.

[24]  Thomas Fahringer,et al.  Using Templates to Predict Execution Time of Scientific Workflow Applications in the Grid , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[25]  Samuel Ajila,et al.  Predicting cloud resource provisioning using machine learning techniques , 2013, 2013 26th IEEE Canadian Conference on Electrical and Computer Engineering (CCECE).

[26]  Paul Watson,et al.  A framework for dynamically generating predictive models of workflow execution , 2013, WORKS@SC.

[27]  Rizos Sakellariou,et al.  A Performance Model to Estimate Execution Time of Scientific Workflows on the Cloud , 2014, 2014 9th Workshop on Workflows in Support of Large-Scale Science.

[28]  Lavanya Ramakrishnan,et al.  The future of scientific workflows , 2018, Int. J. High Perform. Comput. Appl..

[29]  Ian J. Taylor,et al.  Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..

[30]  Yuan-Chun Jiang,et al.  A novel statistical time-series pattern based interval forecasting strategy for activity durations in workflow systems , 2011, J. Syst. Softw..

[31]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[32]  Christopher Stewart,et al.  A Dollar from 15 Cents: Cross-Platform Management for Internet Services , 2008, USENIX Annual Technical Conference.

[33]  Sally A. McKee,et al.  Efficiently exploring architectural design spaces via predictive modeling , 2006, ASPLOS XII.