Predictive performance modeling for distributed batch processing using black box monitoring and machine learning

Abstract In many domains, the previous decade was characterized by increasing data volumes and growing complexity of data analyses, creating new demands for batch processing on distributed systems. Effective operation of these systems is challenging when facing uncertainties about the performance of jobs and tasks under varying resource configurations, e. g., for scheduling and resource allocation. We survey predictive performance modeling (PPM) approaches to estimate performance metrics such as execution duration, required memory or wait times of future jobs and tasks based on past performance observations. We focus on non-intrusive methods, i. e., methods that can be applied to any workload without modification, since the workload is usually a black box from the perspective of the systems managing the computational infrastructure. We classify and compare sources of performance variation, predicted performance metrics, limitations and challenges, required training data, use cases, and the underlying prediction techniques. We conclude by identifying several open problems and pressing research needs in the field.

[1]  Prasanna Balaprakash,et al.  Analytical Performance Modeling and Validation of Intel's Xeon Phi Architecture , 2017, Conf. Computing Frontiers.

[2]  E. Steyerberg,et al.  [Regression modeling strategies]. , 2011, Revista espanola de cardiologia.

[3]  Sven Apel,et al.  Cost-Efficient Sampling for Performance Prediction of Configurable Systems (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[4]  Hui Li,et al.  Mining performance data for metascheduling decision support in the Grid , 2007, Future Gener. Comput. Syst..

[5]  Ken Kennedy,et al.  TaskScheduling Strategies forWorkflow-based Applications inGrids , 2005 .

[6]  Frank E. Harrell,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2001 .

[7]  Alexandru Iosup,et al.  Trace-based evaluation of job runtime and queue wait time predictions in grids , 2009, HPDC '09.

[8]  Rajeev Gandhi,et al.  SALSA: Analyzing Logs as StAte Machines , 2008, WASL.

[9]  Randy H. Katz,et al.  Selecting the best VM across multiple public clouds: a data-driven performance modeling approach , 2017, SoCC.

[10]  Douglas Thain,et al.  Practical Resource Monitoring for Robust High Throughput Computing , 2015, 2015 IEEE International Conference on Cluster Computing.

[11]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[12]  Peter A. Dinda,et al.  An empirical study of the multiscale predictability of network traffic , 2004, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004..

[13]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[14]  Ivona Brandic,et al.  A Survey of the State of the Art in Performance Modeling and Prediction of Parallel and Distributed Computing Systems , 2008 .

[15]  Jian Pei,et al.  A practical method for estimating performance degradation on multicore processors, and its application to HPC workloads , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Lingyun Yang,et al.  Conservative Scheduling: Using Predicted Variance to Improve Scheduling Decisions in Dynamic Environments , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[17]  Miron Livny,et al.  Online Task Resource Consumption Prediction for Scientific Workflows , 2015, Parallel Process. Lett..

[18]  Samuel Kounev,et al.  Evaluating approaches to resource demand estimation , 2015, Perform. Evaluation.

[19]  Lieven Eeckhout,et al.  Performance prediction based on inherent program similarity , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[20]  Kevin Leyton-Brown,et al.  Algorithm runtime prediction: Methods & evaluation , 2012, Artif. Intell..

[21]  Hui Li,et al.  Predicting job start times on clusters , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..

[22]  Dan Tsafrir,et al.  Backfilling Using System-Generated Predictions Rather than User Runtime Estimates , 2007, IEEE Transactions on Parallel and Distributed Systems.

[23]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS 2010.

[24]  Ion Stoica,et al.  Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.

[25]  Selim G. Akl,et al.  Scheduling Algorithms for Grid Computing: State of the Art and Open Problems , 2006 .

[26]  Yang Xiang,et al.  Hadoop Performance Modeling for Job Estimation and Resource Provisioning , 2016, IEEE Transactions on Parallel and Distributed Systems.

[27]  John M. Mellor-Crummey,et al.  Cross-architecture performance predictions for scientific applications using parameterized models , 2004, SIGMETRICS '04/Performance '04.

[28]  Richard Wolski,et al.  The network weather service: a distributed resource performance forecasting service for metacomputing , 1999, Future Gener. Comput. Syst..

[29]  Lior Rokach,et al.  Data Mining with Decision Trees - Theory and Applications , 2007, Series in Machine Perception and Artificial Intelligence.

[30]  Richard Wolski,et al.  Multivariate Resource Performance Forecasting in the Network Weather Service , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[31]  Sena Seneviratne,et al.  A survey on methodologies for runtime prediction on grid environments , 2014, 7th International Conference on Information and Automation for Sustainability.

[32]  Ian T. Foster,et al.  Homeostatic and tendency-based CPU load predictions , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[33]  Yang Gao,et al.  Adaptive grid job scheduling with genetic algorithms , 2005, Future Gener. Comput. Syst..

[34]  Emmanuel Agullo,et al.  Are Static Schedules so Bad? A Case Study on Cholesky Factorization , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[35]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[36]  Jie Liu,et al.  Cuanta: quantifying effects of shared on-chip resource interference for consolidated virtual machines , 2011, SoCC.

[37]  Richard Wolski,et al.  Dynamically forecasting network performance using the Network Weather Service , 1998, Cluster Computing.

[38]  Allen B. Downey Predicting queue times on space-sharing parallel computers , 1997, Proceedings 11th International Parallel Processing Symposium.

[39]  Prasanna Balaprakash,et al.  Explaining Wide Area Data Transfer Performance , 2017, HPDC.

[40]  Martin Schulz,et al.  A regression-based approach to scalability prediction , 2008, ICS '08.

[41]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[42]  Ewa Deelman,et al.  Resource management for scientific workflows , 2012 .

[43]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[44]  Thomas L. Casavant,et al.  A Taxonomy of Scheduling in General-Purpose Distributed Computing Systems , 1988, IEEE Trans. Software Eng..

[45]  Rachid Guerraoui,et al.  ESTIMA: Extrapolating ScalabiliTy of In-Memory Applications , 2017, ACM Trans. Parallel Comput..

[46]  Yi Li,et al.  Multilevel Phase Analysis , 2015, TECS.

[47]  Y.-K. Kwok,et al.  Static scheduling algorithms for allocating directed task graphs to multiprocessors , 1999, CSUR.

[48]  Jennifer M. Schopf,et al.  Using Regression Techniques to Predict Large Data Transfers , 2003, Int. J. High Perform. Comput. Appl..

[49]  N. Draper,et al.  Applied Regression Analysis. , 1967 .

[50]  Rosario M. Piro,et al.  Using historical accounting information to predict the resource usage of grid jobs , 2009, Future Gener. Comput. Syst..

[51]  Tevfik Kosar,et al.  HARP: Predictive Transfer Optimization Based on Historical Analysis and Real-Time Probing , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[52]  Gerhard Wellein,et al.  LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[53]  Kevin Skadron,et al.  Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[54]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[55]  Samuel Williams,et al.  TORCH Computational Reference Kernels - A Testbed for Computer Science Research , 2010 .

[56]  Charles Reiss,et al.  Understanding Memory Configurations for In-Memory Analytics , 2016 .

[57]  Alexander Mendiburu,et al.  A Survey of Performance Modeling and Simulation Techniques for Accelerator-Based Computing , 2015, IEEE Transactions on Parallel and Distributed Systems.

[58]  Marco Aurélio Stelmar Netto,et al.  Helping HPC Users Specify Job Memory Requirements via Machine Learning , 2016, 2016 Third International Workshop on HPC User Support Tools (HUST).

[59]  Marco Aurélio Stelmar Netto,et al.  Job placement advisor based on turnaround predictions for HPC hybrid clouds , 2016, Future Gener. Comput. Syst..

[60]  Xiaobo Zhou,et al.  Improving MapReduce performance in heterogeneous environments with adaptive task tuning , 2014, Middleware.

[61]  Jano I. van Hemert,et al.  Scientific Workflows , 2016, ACM Comput. Surv..

[62]  José A. B. Fortes,et al.  On the Use of Machine Learning to Predict the Time and Resources Consumed by Applications , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[63]  Ravishankar K. Iyer,et al.  Predictability of Process Resource Usage: A Measurement-Based Study on UNIX , 1989, IEEE Trans. Software Eng..

[64]  Zhiling Lan,et al.  Analyzing and adjusting user runtime estimates to improve job scheduling on the Blue Gene/P , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[65]  Barton P. Miller,et al.  Anywhere, any-time binary instrumentation , 2011, PASTE '11.

[66]  Francine Berman,et al.  Heuristics for scheduling parameter sweep applications in grid environments , 2000, Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556).

[67]  Sally A. McKee,et al.  Methods of inference and learning for performance modeling of parallel applications , 2007, PPoPP.

[68]  Dror G. Feitelson,et al.  Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling , 2001, IEEE Trans. Parallel Distributed Syst..

[69]  Daniel A. Menascé,et al.  A Taxonomy of Job Scheduling on Distributed Computing Systems , 2016, IEEE Transactions on Parallel and Distributed Systems.

[70]  Dan Tsafrir,et al.  Experience with using the Parallel Workloads Archive , 2014, J. Parallel Distributed Comput..

[71]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[72]  Fredrik Olsson,et al.  A literature survey of active machine learning in the context of natural language processing , 2009 .

[73]  Ilkay Altintas,et al.  A machine learning approach for modular workflow performance prediction , 2017, WORKS@SC.

[74]  Allen B. Downey,et al.  The elusive goal of workload characterization , 1999, PERV.

[75]  Denis Trystram,et al.  Improving backfilling by using machine learning to predict running times , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[76]  Jan Karel Lenstra,et al.  Complexity of machine scheduling problems , 1975 .

[77]  Christina Freytag,et al.  Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .

[78]  Guangwen Yang,et al.  Adaptive Hybrid Model for Long Term Load Prediction in Computational Grid , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[79]  Rolf Stadler,et al.  Resource Management in Clouds: Survey and Research Challenges , 2015, Journal of Network and Systems Management.

[80]  JenningsBrendan,et al.  Resource Management in Clouds , 2015 .

[81]  Ulf Leser,et al.  DynamicCloudSim: Simulating heterogeneity in computational clouds , 2015, Future Gener. Comput. Syst..

[82]  Robert D. van der Mei,et al.  A prediction method for job runtimes on shared processors: Survey, statistical analysis and new avenues , 2007, Perform. Evaluation.

[83]  Hwanju Kim,et al.  TPC: Target-Driven Parallelism Combining Prediction and Correction to Reduce Tail Latency in Interactive Services , 2016, ASPLOS.

[84]  Sathish S. Vadhiyar,et al.  Performance modeling of parallel applications for grid scheduling , 2008, J. Parallel Distributed Comput..

[85]  Achim Streit,et al.  Scheduling in HPC Resource Management Systems: Queuing vs. Planning , 2003, JSSPP.

[86]  Kai Hwang,et al.  Adaptive Workload Prediction of Grid Performance in Confidence Windows , 2010, IEEE Transactions on Parallel and Distributed Systems.

[87]  Graham R. Nudd,et al.  Pace—A Toolset for the Performance Prediction of Parallel and Distributed Systems , 2000, Int. J. High Perform. Comput. Appl..

[88]  Tiranee Achalakul,et al.  A runtime estimation framework for ALICE , 2017, Future Gener. Comput. Syst..

[89]  Albert Y. Zomaya,et al.  A survey on resource allocation in high performance distributed computing systems , 2013, Parallel Comput..

[90]  Michael Laurenzano,et al.  PEBIL: Efficient static binary instrumentation for Linux , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[91]  Jun Zhang,et al.  Cloud Computing Resource Scheduling and a Survey of Its Evolutionary Approaches , 2015, ACM Comput. Surv..

[92]  Paolo Missier,et al.  Predicting the Execution Time of Workflow Activities Based on Their Input Features , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[93]  Henry Hoffmann,et al.  ESP: A Machine Learning Approach to Predicting Application Interference , 2017, 2017 IEEE International Conference on Autonomic Computing (ICAC).

[94]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[95]  Frank Mueller,et al.  Cross-Platform Performance Prediction of Parallel Applications Using Partial Execution , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[96]  Rizos Sakellariou,et al.  A characterization of workflow management systems for extreme-scale applications , 2016, Future Gener. Comput. Syst..

[97]  Xingfu Wu,et al.  Prophesy: an infrastructure for performance analysis and modeling of parallel and grid applications , 2003, PERV.

[98]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[99]  Warren Smith,et al.  Using Run-Time Predictions to Estimate Queue Wait Times and Improve Scheduler Performance , 1999, JSSPP.

[100]  Elizabeth Pennisi,et al.  Human genome 10th anniversary. Will computers crash genomics? , 2011, Science.

[101]  F. Berman,et al.  Adaptive Performance Prediction for Distributed Data-Intensive Applications , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[102]  Dror G. Feitelson,et al.  Job Characteristics of a Production Parallel Scientivic Workload on the NASA Ames iPSC/860 , 1995, JSSPP.

[103]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.

[104]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[105]  Olivier Beaumont,et al.  Analyzing real cluster data for formulating allocation algorithms in cloud platforms , 2016, Parallel Comput..

[106]  William J. Knottenbelt,et al.  Database system performance evaluation models: A survey , 2012, Perform. Evaluation.

[107]  Sucha Smanchat,et al.  Taxonomies of workflow scheduling problem and techniques in the cloud , 2015, Future Gener. Comput. Syst..

[108]  Nicholas J. Wright,et al.  Modeling and predicting application performance on parallel computers using HPC challenge benchmarks , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[109]  Jonathan D. Cryer,et al.  Time Series Analysis , 1986 .

[110]  William W. S. Wei,et al.  Time series analysis - univariate and multivariate methods , 1989 .

[111]  T. N. Vijaykumar,et al.  Tarazu: optimizing MapReduce on heterogeneous clusters , 2012, ASPLOS XVII.

[112]  Albert Y. Zomaya,et al.  Survey on Grid Resource Allocation Mechanisms , 2014, Journal of Grid Computing.

[113]  Richard Gibbons,et al.  A Historical Application Profiler for Use by Parallel Schedulers , 1997, JSSPP.

[114]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[115]  Jack Dongarra,et al.  Using PAPI for Hardware Performance Monitoring on Linux Systems , 2001 .

[116]  Yong Zhao,et al.  Cloud Computing and Grid Computing 360-Degree Compared , 2008, GCE 2008.

[117]  Uwe Schwiegelshohn,et al.  Job Allocation Strategies with User Run Time Estimates for Online Scheduling in Hierarchical Grids , 2011, Journal of Grid Computing.

[119]  R. Wolski,et al.  Predicting the CPU availability of time‐shared Unix systems on the computational grid , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[120]  N. Edna Elizabeth,et al.  Network's server monitoring and analysis using Nagios , 2017, 2017 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET).

[121]  Lee C. Potter,et al.  Statistical prediction of task execution times through analytic benchmarking for scheduling in a heterogeneous environment , 1999, Proceedings. Eighth Heterogeneous Computing Workshop (HCW'99).

[122]  Sally A. McKee,et al.  An Approach to Performance Prediction for Parallel Applications , 2005, Euro-Par.

[123]  Yanmin Zhu,et al.  A Survey on Grid Scheduling Systems , 2013 .

[124]  Richard Wolski,et al.  QBETS: queue bounds estimation from time series , 2007, SIGMETRICS '07.

[125]  Wei Sun,et al.  Predict task running time in grid environments based on CPU load predictions , 2008, Future Gener. Comput. Syst..

[126]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[127]  Calton Pu,et al.  An Analysis of Performance Interference Effects in Virtual Environments , 2007, 2007 IEEE International Symposium on Performance Analysis of Systems & Software.

[128]  Barbara Paech,et al.  Integrating business process simulation and information system simulation for performance prediction , 2017, Software & Systems Modeling.

[129]  Warren Smith Prediction Services for Distributed Computing , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[130]  Thomas Fahringer,et al.  Optimizing execution time predictions of scientific workflow applications in the Grid through evolutionary programming , 2013, Future Gener. Comput. Syst..

[131]  N. B. Anuar,et al.  The rise of "big data" on cloud computing: Review and open research issues , 2015, Inf. Syst..

[132]  Murtaza Haider,et al.  Beyond the hype: Big data concepts, methods, and analytics , 2015, Int. J. Inf. Manag..

[133]  Richard Wolski,et al.  Predicting bounds on queuing delay for batch-scheduled parallel machines , 2006, PPoPP '06.

[134]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[135]  Peter A. Dinda,et al.  Online Prediction of the Running Time of Tasks , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[136]  Ralf H. Reussner,et al.  Performance Prediction for Black-Box Components Using Reengineered Parametric Behaviour Models , 2008, CBSE.

[137]  Xiaobing Feng,et al.  Predicting Cross-Core Performance Interference on Multicore Processors with Regression Analysis , 2016, IEEE Transactions on Parallel and Distributed Systems.

[138]  Peter A. Dinda,et al.  A prediction-based real-time scheduling advisor , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[139]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[140]  Lieven Eeckhout,et al.  Microarchitecture-Independent Workload Characterization , 2007, IEEE Micro.