Exploring the Role of Machine Learning in Scientific Workflows: Opportunities and Challenges

In this survey, we discuss the challenges of executing scientific workflows as well as existing Machine Learning (ML) techniques to alleviate those challenges. We provide the context and motivation for applying ML to each step of the execution of these workflows. Furthermore, we provide recommendations on how to extend ML techniques to unresolved challenges in the execution of scientific workflows. Moreover, we discuss the possibility of using ML techniques for in-situ operations. We explore the challenges of in-situ workflows and provide suggestions for improving the performance of their execution using ML techniques.

[1]  Manish Parashar,et al.  Scalable Graph Embedding LearningOn A Single GPU , 2021, ArXiv.

[2]  Thomas Fahringer,et al.  Predicting Workflow Task Execution Time in the Cloud Using A Two-Stage Machine Learning Approach , 2020, IEEE Transactions on Cloud Computing.

[3]  Sam Ade Jacobs,et al.  Merlin: Enabling Machine Learning-Ready HPC Ensembles , 2019, ArXiv.

[4]  Brian Gallagher,et al.  A deep learning framework for mesh relaxation in arbitrary Lagrangian-Eulerian simulations , 2019, Optical Engineering + Applications.

[5]  Manish Parashar,et al.  Leveraging Machine Learning for Anticipatory Data Delivery in Extreme Scale In-situ Workflows , 2019, 2019 IEEE International Conference on Cluster Computing (CLUSTER).

[6]  Zhiguang Chen,et al.  Optimizing Data Placement on Hierarchical Storage Architecture via Machine Learning , 2019, NPC.

[7]  Fangfang Xia,et al.  Performance, Energy, and Scalability Analysis and Improvement of Parallel Cancer Deep Learning CANDLE Benchmarks , 2019, ICPP.

[8]  Manish Parashar,et al.  Towards a Smart, Internet-Scale Cache Service for Data Intensive Scientific Applications , 2019, ScienceCloud@HPDC.

[9]  Rizos Sakellariou,et al.  The role of machine learning in scientific workflows , 2019, Int. J. High Perform. Comput. Appl..

[10]  Jonathan Ozik,et al.  Nested active learning for efficient model contextualization and parameterization: pathway to generating simulated populations using multi-scale computational models , 2019, bioRxiv.

[11]  Marta Mattoso,et al.  Keeping Track of User Steering Actions in Dynamic Workflows , 2019, Future Gener. Comput. Syst..

[12]  Pericles A. Mitkas,et al.  Reinforcement Learning based scheduling in a workflow management system , 2019, Eng. Appl. Artif. Intell..

[13]  Kary A. C. S. Ocaña,et al.  Provenance-based fault tolerance technique recommendation for cloud-based scientific workflows: a practical approach , 2019, Cluster Computing.

[14]  E. Wes Bethel,et al.  ASCR Workshop on In Situ Data Management: Enabling Scientific Discovery from Diverse Data Sources , 2019 .

[15]  A. Umamakeswari,et al.  Development of cognitive fault tolerant model for scientific workflows by integrating overlapped migration and check-pointing approach , 2019, Journal of Ambient Intelligence and Humanized Computing.

[16]  Habib N. Najm,et al.  Workshop Report on Basic Research Needs for Scientific Machine Learning: Core Technologies for Artificial Intelligence , 2018 .

[17]  Steven Lee,et al.  Brochure on Basic Research Needs for Scientific Machine Learning: Core Technologies for Artificial Intelligence , 2018 .

[18]  Ilkay Altintas,et al.  Deep Learning for Enhancing Fault Tolerant Capabilities of Scientific Workflows , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[19]  Fangfang Xia,et al.  CANDLE/Supervisor: a workflow framework for machine learning applied to cancer research , 2018, BMC Bioinformatics.

[20]  Rajkumar Buyya,et al.  Detecting performance anomalies in scientific workflows using hierarchical temporal memory , 2018, Future Gener. Comput. Syst..

[21]  Scott Klasky,et al.  Stacker: An Autonomic Data Movement Engine for Extreme-Scale Data Staging-Based In-Situ Workflows , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Fang Liu,et al.  A novel in situ compression method for CFD data based on generative adversarial network , 2018, J. Vis..

[23]  Rajkumar Buyya,et al.  Task Runtime Prediction in Scientific Workflows Using an Online Incremental Learning Approach , 2018, 2018 IEEE/ACM 11th International Conference on Utility and Cloud Computing (UCC).

[24]  Hank Childs,et al.  Data Reduction Techniques for Simulation, Visualization and Data Analysis , 2018, Comput. Graph. Forum.

[25]  Daniel M. Dunlavy,et al.  Embedding Python for In-Situ Analysis. , 2018 .

[26]  Marta Mattoso,et al.  Provenance of Dynamic Adaptations in User-Steered Dataflows , 2018, IPAW.

[27]  Fengguang Song,et al.  Building a scientific workflow framework to enable real‐time machine learning and visualization , 2018, Concurr. Comput. Pract. Exp..

[28]  Aditya G. Parameswaran,et al.  Accelerating Human-in-the-loop Machine Learning: Challenges and Opportunities , 2018, DEEM@SIGMOD.

[29]  Tonio Buonassisi,et al.  Accelerating Materials Development via Automation, Machine Learning, and High-Performance Computing , 2018, Joule.

[30]  Marco Aurélio Stelmar Netto,et al.  JobPruner: A Machine Learning Assistant for Exploring Parameter Spaces in HPC Applications , 2018, Future Gener. Comput. Syst..

[31]  Daniel Crawl,et al.  Modular Resource Centric Learning for Workflow Performance Prediction , 2017, ArXiv.

[32]  Osman S. Unsal,et al.  A Machine Learning Approach for Performance Prediction and Scheduling on Heterogeneous CPUs , 2017, 2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[33]  Jun Wang,et al.  SideIO: A Side I/O system framework for hybrid scientific workflow , 2017, J. Parallel Distributed Comput..

[34]  Kevin A. Reed,et al.  Using feature importance metrics to detect events of interest in scientific computing applications , 2017, 2017 IEEE 7th Symposium on Large Data Analysis and Visualization (LDAV).

[35]  Feng Li,et al.  A Real-Time Machine Learning and Visualization Framework for Scientific Workflows , 2017, PEARC.

[36]  Florin Pop,et al.  New scheduling approach using reinforcement learning for heterogeneous distributed systems , 2017, J. Parallel Distributed Comput..

[37]  Herodotos Herodotou,et al.  OctopusFS: A Distributed File System with Tiered Storage Management , 2017, SIGMOD Conference.

[38]  Lei Zhang,et al.  Task scheduling and resource allocation algorithm in cloud computing system based on non-cooperative game , 2017, 2017 IEEE 2nd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA).

[39]  Kai Chen,et al.  Collaborative filtering and deep learning based recommendation system for cold start items , 2017, Expert Syst. Appl..

[40]  Yolanda Gil,et al.  Enhancing reproducibility for computational methods , 2016, Science.

[41]  Kesheng Wu,et al.  Data Elevator: Low-Contention Data Movement in Hierarchical Storage System , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).

[42]  David Pugmire,et al.  Performance Modeling of In Situ Rendering , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[43]  Gunther H. Weber,et al.  Performance Analysis, Design Considerations, and Applications of Extreme-Scale In Situ Infrastructures , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[44]  Scott Klasky,et al.  Visualization and Analysis Requirements for In Situ Processing for a Large-Scale Fusion Simulation Code , 2016, 2016 Second Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization (ISAV).

[45]  Srikanth Kandula,et al.  Resource Management with Deep Reinforcement Learning , 2016, HotNets.

[46]  Michel Riveill,et al.  Towards a Software Product Line for Machine Learning Workflows: Focus on Supporting Evolution , 2016, ME@MoDELS.

[47]  Matthieu Dreher,et al.  Bredala: Semantic Data Redistribution for In Situ Applications , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[48]  Brian Gallagher,et al.  A Supervised Learning Framework for Arbitrary Lagrangian-Eulerian Simulations , 2016, 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA).

[49]  Ewa Deelman,et al.  Anomaly detection for scientific workflow applications on networked clouds , 2016, 2016 International Conference on High Performance Computing & Simulation (HPCS).

[50]  Xiaomin Zhu,et al.  Uncertainty-Aware Real-Time Workflow Scheduling in the Cloud , 2016, 2016 IEEE 9th International Conference on Cloud Computing (CLOUD).

[51]  Scott Klasky,et al.  In Situ Methods, Infrastructures, and Applications on High Performance Computing Platforms , 2016, Comput. Graph. Forum.

[52]  Robert Sisneros,et al.  Tuned to Terrible: A Study of Parallel Particle Advection State of the Practice , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[53]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[54]  Biqing Huang,et al.  A scientific workflow management system architecture and its scheduling based on cloud service platform for manufacturing big data analytics , 2016 .

[55]  Wei Chen,et al.  FireWorks: a dynamic workflow system designed for high‐throughput applications , 2015, Concurr. Comput. Pract. Exp..

[56]  David A. Monge,et al.  Ensemble learning of runtime prediction models for gene-expression analysis workflows , 2015, Cluster Computing.

[57]  Chuck Atkins,et al.  In Situ Analysis as a Parallel I/O Problem , 2015, ISAV@SC.

[58]  Utkarsh Ayachit,et al.  ParaView Catalyst: Enabling In Situ Data Analysis and Visualization , 2015, ISAV@SC.

[59]  Sarbjeet Singh,et al.  A review of metaheuristic scheduling techniques in cloud computing , 2015 .

[60]  Miron Livny,et al.  Online Task Resource Consumption Prediction for Scientific Workflows , 2015, Parallel Process. Lett..

[61]  Patrick Valduriez,et al.  OpenAlea: scientific workflows combining data analysis and simulation , 2015, SSDBM.

[62]  Hong Zhao,et al.  Supervised Machine Learning Model for High Dimensional Gene Data in Colon Cancer Detection , 2015, 2015 IEEE International Congress on Big Data.

[63]  Miron Livny,et al.  Pegasus, a workflow management system for science automation , 2015, Future Gener. Comput. Syst..

[64]  Marta Mattoso,et al.  Dynamic steering of HPC scientific workflows: A survey , 2015, Future Gener. Comput. Syst..

[65]  Bruce D'Amora,et al.  Computational steering for high performance computing: applications on Blue Gene/Q system , 2015, SpringSim.

[66]  Inderveer Chana,et al.  Autonomic fault tolerant scheduling approach for scientific workflows in Cloud computing , 2015, Concurr. Eng. Res. Appl..

[67]  Inderveer Chana,et al.  Intelligent failure prediction models for scientific workflows , 2015, Expert Syst. Appl..

[68]  Felix Jungermann,et al.  Information Extraction with RapidMiner , 2015 .

[69]  Jianwu Wang,et al.  A Scalable Data Science Workflow Approach for Big Data Bayesian Network Learning , 2014, 2014 IEEE/ACM International Symposium on Big Data Computing.

[70]  Christopher Andrews,et al.  The human is the loop: new directions for visual analytics , 2014, Journal of Intelligent Information Systems.

[71]  Scott Klasky,et al.  Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[72]  Rizos Sakellariou,et al.  A Performance Model to Estimate Execution Time of Scientific Workflows on the Cloud , 2014, 2014 9th Workshop on Workflows in Support of Large-Scale Science.

[73]  David Abramson,et al.  WorkWays: Interacting with Scientific Workflows , 2014, 2014 9th Gateway Computing Environments Workshop.

[74]  Hui Lin,et al.  Hybrid Ant Colony Algorithm Clonal Selection in the Application of the Cloud's Resource Scheduling , 2014, ArXiv.

[75]  James P. Ahrens,et al.  ADR visualization: A generalized framework for ranking large-scale scientific data using Analysis-Driven Refinement , 2014, 2014 IEEE 4th Symposium on Large Data Analysis and Visualization (LDAV).

[76]  Carlos García Garino,et al.  Ensemble Learning of Run-Time Prediction Models for Data-Intensive Scientific Workflows , 2014, CARLA.

[77]  Hal Finkel,et al.  HACC: Simulating Sky Surveys on State-of-the-Art Supercomputing Architectures , 2014, 1410.2805.

[78]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[79]  Anastasia Ailamaki,et al.  Adaptive Query Processing on RAW Data , 2014, Proc. VLDB Endow..

[80]  Scott Shenker,et al.  Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks , 2014, SoCC.

[81]  Xue-wen Chen,et al.  Big Data Deep Learning: Challenges and Perspectives , 2014, IEEE Access.

[82]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[83]  Rajkumar Buyya,et al.  Deadline Based Resource Provisioningand Scheduling Algorithm for Scientific Workflows on Clouds , 2014, IEEE Transactions on Cloud Computing.

[84]  R. Kitchin,et al.  Big Data, new epistemologies and paradigm shifts , 2014, Big Data Soc..

[85]  Chen Junjie,et al.  An optimized scheduling algorithm on a cloud workflow using a discrete particle swarm , 2014 .

[86]  Xiaorong Li,et al.  Multi-Objective Game Theoretic Schedulingof Bag-of-Tasks Workflows on Hybrid Clouds , 2014, IEEE Transactions on Cloud Computing.

[87]  Karsten Schwan,et al.  GoldRush: Resource efficient in situ scientific data analytics using fine-grained interference aware execution , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[88]  Sebastien Rey-Coyrehourcq,et al.  OpenMOLE, a workflow engine specifically tailored for the distributed exploration of simulation models , 2013, Future Gener. Comput. Syst..

[89]  Ann L. Chervenak,et al.  Characterizing and profiling scientific workflows , 2013, Future Gener. Comput. Syst..

[90]  Paolo Missier,et al.  Predicting the Execution Time of Workflow Activities Based on Their Input Features , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[91]  Fan Zhang,et al.  Combining in-situ and in-transit processing to enable extreme-scale scientific analysis , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[92]  Scott Klasky,et al.  DataSpaces: an interaction and coordination framework for coupled simulation workflows , 2012, HPDC '10.

[93]  Yolanda Gil,et al.  A semantic framework for automatic generation of computational workflows using distributed data and component catalogues , 2011, J. Exp. Theor. Artif. Intell..

[94]  Yolanda Gil,et al.  A new approach for publishing workflows: abstractions, standards, and linked data , 2011, WORKS '11.

[95]  Manish Parashar,et al.  Addressing the petascale data challenge using in-situ analytics , 2011, PDAC '11.

[96]  Marta Mattoso,et al.  Supporting dynamic parameter sweep in adaptive and user-steered workflow , 2011, WORKS '11.

[97]  Arie Shoshani,et al.  Parallel index and query for large scale data analysis , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[98]  Dominique Brodbeck,et al.  Research directions in data wrangling: Visualizations and transformations for usable and credible data , 2011, Inf. Vis..

[99]  Rajkumar Buyya,et al.  Optimizing the makespan and reliability for workflow applications with reputation and a look-ahead genetic algorithm , 2011, Future Gener. Comput. Syst..

[100]  Ying Wang,et al.  Enabling Data and Compute Intensive Workflows in Bioinformatics , 2011, Euro-Par Workshops.

[101]  Chase Qishi Wu,et al.  On Performance Modeling and Prediction in Support of Scientific Workflow Optimization , 2011, 2011 IEEE World Congress on Services.

[102]  Shiyong Lu,et al.  Scheduling Scientific Workflows Elastically for Cloud Computing , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[103]  Ioannis G. Kevrekidis,et al.  Nonlinear dimensionality reduction in molecular simulation: The diffusion map approach , 2011 .

[104]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[105]  Arun Jagatheesan,et al.  Understanding the Impact of Emerging Non-Volatile Memories on High-Performance, IO-Intensive Computing , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[106]  G. Nolan,et al.  Computational solutions to large-scale data management and analysis , 2010, Nature Reviews Genetics.

[107]  Carole A. Goble,et al.  myExperiment: a repository and social network for the sharing of bioinformatics workflows , 2010, Nucleic Acids Res..

[108]  Karsten Schwan,et al.  PreDatA – preparatory data analytics on peta-scale machines , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[109]  Arie Shoshani,et al.  Scientific Data Management - Challenges, Technology, and Deployment , 2009, Scientific Data Management.

[110]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[111]  Prabhat,et al.  FastBit: interactively searching massive data , 2009 .

[112]  Ian J. Taylor,et al.  Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..

[113]  Xiao Liu,et al.  Forecasting Duration Intervals of Scientific Workflow Activities Based on Time-Series Patterns , 2008, 2008 IEEE Fourth International Conference on eScience.

[114]  Hans Hagen,et al.  High performance multivariate visual data exploration for extremely large data , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[115]  M. Polte,et al.  Fast log-based concurrent writing of checkpoints , 2008, 2008 3rd Petascale Data Storage Workshop.

[116]  Yogesh L. Simmhan,et al.  Efficient scheduling of scientific workflows in a high performance computing cluster , 2008, CLADE '08.

[117]  Karsten Schwan,et al.  Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS) , 2008, CLADE '08.

[118]  Mirek Riedewald,et al.  Provenance in High-Energy Physics Workflows , 2008, Computing in Science & Engineering.

[119]  Bin Zhou,et al.  Scalable Performance of the Panasas Parallel File System , 2008, FAST.

[120]  Eva Klien,et al.  A Rule‐Based Strategy for the Semantic Annotation of Geodata , 2007, Trans. GIS.

[121]  Sara J. Graves,et al.  CASA and LEAD: adaptive cyberinfrastructure for real-time multiscale weather forecasting , 2006, Computer.

[122]  Stephen A. Jarvis,et al.  An Investigation into the Application of Different Performance Prediction Methods to Distributed Enterprise Applications , 2005, The Journal of Supercomputing.

[123]  Rajkumar Buyya,et al.  A taxonomy of scientific workflow systems for grid computing , 2005, SGMD.

[124]  Thomas Fahringer,et al.  Performance Prophet: a performance modeling and prediction tool for parallel and distributed programs , 2005, 2005 International Conference on Parallel Processing Workshops (ICPPW'05).

[125]  Jaideep Srivastava,et al.  Managing Cyber Threats: Issues, Approaches, and Challenges (Massive Computing) , 2005 .

[126]  Laxmikant V. Kalé,et al.  Simulation-Based Performance Prediction for Large Parallel Machines , 2005, International Journal of Parallel Programming.

[127]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[128]  Bertram Ludäscher,et al.  An Ontology-Driven Framework for Data Transformation in Scientific Workflows , 2004, DILS.

[129]  Garth A. Gibson,et al.  Automatic I/O hint generation through speculative execution , 1999, OSDI '99.

[130]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[131]  Jeffrey D. Ullman,et al.  NP-Complete Scheduling Problems , 1975, J. Comput. Syst. Sci..

[132]  Daniyal Alghazzawi,et al.  Using Machine Learning Ensemble Methods to Predict Execution Time of e-Science Workflows in Heterogeneous Distributed Systems , 2019, IEEE Access.

[133]  Alistair Revell,et al.  A Real-Time Modelling and Simulation Platform for Virtual Engineering Design and Analysis , 2018 .

[134]  Ewa Deelman,et al.  Event-Based Triggering and Management of Scientific Workflow Ensembles , 2018 .

[135]  Martin Maier,et al.  Workflow Scheduling in Multi-Tenant Cloud Computing Environments , 2017, IEEE Transactions on Parallel and Distributed Systems.

[136]  Allen D. Malony,et al.  WOWMON: A Machine Learning-based Profiler for Self-adaptive Instrumentation of Scientific Workflows , 2016, ICCS.

[137]  Daniel Crawl,et al.  Integrated Machine Learning in the Kepler Scientific Workflow System , 2016, ICCS.

[138]  Cecelia DeLuca,et al.  Toward self-describing and workflow integrated Earth system models: A coupled atmosphere-ocean modeling system application , 2013, Environ. Model. Softw..

[139]  Arie Shoshani,et al.  In situ data processing for extreme-scale computing , 2011 .

[140]  Yogesh L. Simmhan,et al.  Using Provenance for Personalized Quality Ranking of Scientific Datasets , 2011, Int. J. Comput. Their Appl..

[141]  M Maja Pesic,et al.  Constraint-based workflow management systems : shifting control to users , 2008 .

[142]  Ian J. Taylor,et al.  The Triana Workflow Environment: Architecture and Applications , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[143]  Harvey J. Miller,et al.  Time-space transformations of geographic space for exploring, analyzing and visualizing transportation systems , 2007 .

[144]  Nathalie Furmento,et al.  ICENI Dataflow and Workflow: Composition and Scheduling in Space and Time , 2003 .