Scalable Workflow-Driven Hydrologic Analysis in HydroFrame

The HydroFrame project is a community platform designed to facilitate integrated hydrologic modeling across the US. As a part of HydroFrame, we seek to design innovative workflow solutions that create pathways to enable hydrologic analysis for three target user groups: the modeler, the analyzer, and the domain science educator. We present the initial progress on the HydroFrame community platform using an automated Kepler workflow. This workflow performs end-to-end hydrology simulations involving data ingestion, preprocessing, analysis, modeling, and visualization. We demonstrate how different modules of the workflow can be reused and repurposed for the three target user groups. The Kepler workflow ensures complete reproducibility through a built-in provenance framework that collects workflow specific parameters, software versions, and hardware system configuration. In addition, we aim to optimize the utilization of large-scale computational resources to adjust to the needs of all three user groups. Towards this goal, we present a design that leverages provenance data and machine learning techniques to predict performance and forecast failures using an automatic performance collection component of the pipeline.

[1]  Marc F. P. Bierkens,et al.  Global hydrology 2015: State, trends, and directions , 2015 .

[2]  R. Maxwell,et al.  A high-resolution simulation of groundwater and surface water over most of the continental US with the integrated hydrologic model ParFlow v3 , 2015 .

[3]  Ilkay Altintas,et al.  Biomedical Big Data Training Collaborative (BBDTC): An effort to bridge the talent gap in biomedical science and research , 2017, J. Comput. Sci..

[4]  Csiro Ict,et al.  Challenges in using scientific workflow tools in the hydrology domain , 2009 .

[5]  Christopher Hutton,et al.  Most computational hydrology is not reproducible, so is it really science? , 2016, Water Resources Research.

[6]  Peter Fitch,et al.  Challenges and Solutions in Implementing Hydrological Models within Scientific Workflow Software , 2010 .

[7]  Jan Vanderborght,et al.  Proof of concept of regional scale hydrologic simulations at hydrologic resolution utilizing massively parallel computer resources , 2010 .

[8]  Ilkay Altintas,et al.  Deep Learning for Enhancing Fault Tolerant Capabilities of Scientific Workflows , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[9]  Michael Piasecki,et al.  Using the Workflow Engine TRIDENT as a Hydrologic Modeling Platform , 2010 .

[10]  R. Maxwell,et al.  Connections between groundwater flow and transpiration partitioning , 2016, Science.

[11]  Jianwu Wang,et al.  Challenges and approaches for distributed workflow-driven analysis of large-scale biological data: vision paper , 2012, EDBT-ICDT '12.

[12]  Jianwu Wang,et al.  A Scalable Data Science Workflow Approach for Big Data Bayesian Network Learning , 2014, 2014 IEEE/ACM International Symposium on Big Data Computing.

[13]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[14]  Dmitri Kavetski,et al.  A unified approach for process‐based hydrologic modeling: 1. Modeling concept , 2015 .

[15]  Ilkay Altintas,et al.  A Kepler Workflow Tool for Reproducible AMBER GPU Molecular Dynamics. , 2017, Biophysical journal.

[16]  R. Maxwell A terrain-following grid transform and preconditioner for parallel, large-scale, integrated hydrologic modeling , 2013 .

[17]  Daniel Crawl,et al.  Towards an Integrated Cyberinfrastructure for Scalable Data-driven Monitoring, Dynamic Prediction and Resilience of Wildfires , 2015, ICCS.

[18]  Peter Fitch,et al.  Hydrologists Workbench – a hydrological domain workflow toolkit , 2010 .

[19]  M. Ek,et al.  Hyperresolution global land surface modeling: Meeting a grand challenge for monitoring Earth's terrestrial water , 2011 .

[20]  Philip J. Rasch,et al.  Parameterizing deep convection using the assumed probability density function method , 2014 .

[21]  Yu Qian,et al.  FlowGate: towards extensible and scalable web-based flow cytometry data analysis , 2015, XSEDE.

[22]  S. Ashby,et al.  A parallel multigrid preconditioned conjugate gradient algorithm for groundwater flow simulations , 1996 .

[23]  Reagan Moore,et al.  Using a data grid to automate data preparation pipelines required for regional-scale hydrologic modeling , 2016, Environ. Model. Softw..

[24]  Daniel Crawl,et al.  Toward a Methodology and Framework for Workflow-Driven Team Science , 2019, Computing in Science & Engineering.

[25]  Daniel Crawl,et al.  A scalable approach for location-specific detection of Santa Ana conditions , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[26]  R. Maxwell,et al.  Integrated surface-groundwater flow modeling: A free-surface overland flow boundary condition in a parallel groundwater flow model , 2006 .

[27]  Ilkay Altintas,et al.  A machine learning approach for modular workflow performance prediction , 2017, WORKS@SC.

[28]  Eric A. Brewer,et al.  Borg, Omega, and Kubernetes , 2016, ACM Queue.

[29]  Daniel Crawl,et al.  Modular Resource Centric Learning for Workflow Performance Prediction , 2017, ArXiv.

[30]  Jianwu Wang,et al.  MAAMD: a workflow to standardize meta-analyses and comparison of affymetrix microarray data , 2014, BMC Bioinformatics.

[31]  Jim E. Jones,et al.  Newton–Krylov-multigrid solvers for large-scale, highly heterogeneous, variably saturated flow problems , 2001 .

[32]  Albert F. Lawrence,et al.  EPiK-a Workflow for Electron Tomography in Kepler , 2015, ICCS.