Predicting the Execution Time of Workflow Activities Based on Their Input Features

The ability to accurately estimate the execution time of computationally expensive e-science algorithms enables better scheduling of workflows that incorporate those algorithms as their building blocks, and may give users an insight into the expected cost of workflow execution on cloud resources. When a large history of past runs can be observed, crude estimates such as the average execution time can easily be provided. We make the hypothesis that, for some algorithms, better estimates can be obtained by using the histories to learn regression models that predict execution time based on selected features of their inputs. We refer to this property as input predictability of algorithms. We are motivated by e-science workflows that involve repetitive training of multiple learning models. Thus, we verify our hypothesis on the specific case of the C4.5 decision tree builder, a well-known learning method whose training execution time is indeed sensitive to the specific input dataset, but in non-obvious ways. We use the case study to demonstrate a method for assessing input predictability. While this yields promising results, we also find that its more general applicability involves a trade off between the black-box nature of the algorithms under analysis, and the need for expert insight into relevant features of their inputs.

[1]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[2]  José A. B. Fortes,et al.  On the Use of Machine Learning to Predict the Time and Resources Consumed by Applications , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[3]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[4]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[5]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[6]  Ian J. Taylor,et al.  Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..

[7]  James A. Anderson,et al.  Neurocomputing: Foundations of Research , 1988 .

[8]  Dan Tsafrir,et al.  Modeling User Runtime Estimates , 2005, JSSPP.

[9]  Shiyong Lu,et al.  Scheduling Scientific Workflows Elastically for Cloud Computing , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[10]  Ralf H. Reussner,et al.  Performance Prediction for Black-Box Components Using Reengineered Parametric Behaviour Models , 2008, CBSE.

[11]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[12]  Marta Mattoso,et al.  Towards a Cost Model for Scheduling Scientific Workflows Activities in Cloud Environments , 2011, 2011 IEEE World Congress on Services.

[13]  Eui-nam Huh,et al.  A probabilistic and adaptive scheduling algorithm using system-generated predictions for inter-grid resource sharing , 2007, The Journal of Supercomputing.

[14]  Jens Volkert,et al.  Adaps - A three-phase adaptive prediction system for the run-time of jobs based on user behaviour , 2011, J. Comput. Syst. Sci..

[15]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[16]  Thomas Fahringer,et al.  Predicting the execution time of grid workflow applications through local learning , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[17]  Yuan-Chun Jiang,et al.  A novel statistical time-series pattern based interval forecasting strategy for activity durations in workflow systems , 2011, J. Syst. Softw..

[18]  Paolo Missier,et al.  Incremental Workflow Improvement Through Analysis of Its Data Provenance , 2011, TaPP.

[19]  Radu Prodan,et al.  A Hybrid Intelligent Method for Performance Modeling and Prediction of Workflow Activities in Grids , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[20]  Lee C. Potter,et al.  Statistical prediction of task execution times through analytic benchmarking for scheduling in a heterogeneous environment , 1999, Proceedings. Eighth Heterogeneous Computing Workshop (HCW'99).

[21]  David E. Leahy,et al.  Automated QSPR through Competitive Workflow , 2005, J. Comput. Aided Mol. Des..

[22]  Ian H. Witten,et al.  Data Mining: Practical Machine Learning Tools and Techniques, 3/E , 2014 .

[23]  Liping Zhang,et al.  A multi-strategy collaborative prediction model for the runtime of online tasks in computing cluster/grid , 2010, Cluster Computing.

[24]  Dan Tsafrir,et al.  Backfilling Using System-Generated Predictions Rather than User Runtime Estimates , 2007, IEEE Transactions on Parallel and Distributed Systems.

[25]  Thomas Fahringer,et al.  Using Templates to Predict Execution Time of Scientific Workflow Applications in the Grid , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[26]  Robert D. van der Mei,et al.  Effective Prediction of Job Processing Times in a Large-Scale Grid Environment , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[27]  Carla E. Brodley,et al.  Predictive application-performance modeling in a computational grid environment , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[28]  Paul Watson,et al.  Cloud computing for fast prediction of chemical activity , 2013, Future Gener. Comput. Syst..

[29]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .