Harnessing the Power of Many: Extensible Toolkit for Scalable Ensemble Applications

Many scientific problems require multiple distinct computational tasks to be executed in order to achieve a desired solution. We introduce the Ensemble Toolkit (EnTK) to address the challenges of scale, diversity and reliability they pose. We describe the design and implementation of EnTK, characterize its performance and integrate it with two exemplar use cases: seismic inversion and adaptive analog ensembles. We perform nine experiments, characterizing EnTK overheads, strong and weak scalability, and the performance of the two use case imple-mentations, at scale and on production infrastructures. We show how EnTK meets the following general requirements: (i) imple-menting dedicated abstractions to support the description and execution of ensemble applications; (ii) support for execution on heterogeneous computing infrastructures; (iii) efficient scalability up to O(10^4) tasks; and (iv) task-level fault tolerance. We discuss novel computational capabilities that EnTK enables and the scientific advantages arising thereof. We propose EnTK as an important addition to the suite of tools in support of production scientific computing.

[1]  Hui Wan,et al.  Short ensembles: an efficient method for discerning climate-relevant sensitivities in atmospheric general circulation models , 2014 .

[2]  Michael Stonebraker,et al.  Too much middleware , 2002, SGMD.

[3]  G. Sherlock,et al.  Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNA-Seq reads , 2010, BMC Genomics.

[4]  Massimo Cafaro,et al.  Secure Web Services with Globus GSI and gSOAP , 2003, Euro-Par.

[5]  Edward B. Duffy,et al.  JUMMP: Job Uninterrupted Maneuverable MapReduce Platform , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[6]  L. D. Monache,et al.  An analog ensemble for short-term probabilistic solar power forecast , 2015 .

[7]  C. Schütte,et al.  Supplementary Information for “ Constructing the Equilibrium Ensemble of Folding Pathways from Short Off-Equilibrium Simulations ” , 2009 .

[8]  Guido Cervone,et al.  Risk assessment of atmospheric emissions using machine learning , 2008 .

[9]  Devarshi Ghoshal,et al.  Tigres Workflow Library: Supporting Scientific Pipelines on HPC Systems , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[10]  Jean Virieux,et al.  An overview of full-waveform inversion in exploration geophysics , 2009 .

[11]  Peter J. Tonellato,et al.  COSMOS: Python library for massively parallel workflows , 2014, Bioinform..

[12]  Y. Sugita,et al.  Replica-exchange molecular dynamics method for protein folding , 1999 .

[13]  S. Krishnan myHadoop-Hadoop-on-Demand on Traditional HPC Resources , 2004 .

[14]  Peter Bauer,et al.  The quiet revolution of numerical weather prediction , 2015, Nature.

[15]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[16]  Shantenu Jha,et al.  Ensemble Toolkit: Scalable and Flexible Execution of Ensembles of Tasks , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[17]  Ian J. Taylor,et al.  The Triana Workflow Environment: Architecture and Applications , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[18]  Frank Noé,et al.  Markov state models of biomolecular conformational dynamics. , 2014, Current opinion in structural biology.

[19]  Jun'ichi Tsujii,et al.  Design and Implementation of GXP Make - A Workflow System Based on Make , 2010, eScience.

[20]  David Pugmire,et al.  Global adjoint tomography: first-generation model , 2016 .

[21]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[22]  Mitsuhisa Sato,et al.  OmniRPC: a grid RPC system for parallel programming in cluster and grid environment , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[23]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[24]  Shawn Hoon,et al.  Biopipe: a flexible framework for protocol-based bioinformatics analysis. , 2003, Genome research.

[25]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[26]  Daniel R. Roe,et al.  The Impact of Heterogeneous Computing on Workflows for Biomolecular Simulation and Analysis , 2015, Computing in Science & Engineering.

[27]  Grant M. Rotskoff,et al.  Molecular simulation workflows as parallel algorithms: the execution engine of Copernicus, a distributed high-performance computing platform. , 2015, Journal of chemical theory and computation.

[28]  Daniel S. Katz,et al.  Swift: A language for distributed parallel scripting , 2011, Parallel Comput..

[29]  Berk Hess,et al.  GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers , 2015 .

[30]  Leo Goodstadt,et al.  Ruffus: a lightweight Python library for computational pipelines , 2010, Bioinform..

[31]  D. Komatitsch,et al.  Spectral-element simulations of global seismic wave propagation—I. Validation , 2002 .

[32]  Shantenu Jha,et al.  Hadoop on HPC: Integrating Hadoop and Pilot-Based Dynamic Resource Management , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[33]  Shantenu Jha,et al.  Design and Performance Characterization of RADICAL-Pilot on Titan , 2018, ArXiv.

[34]  N. K. Bose,et al.  High resolution image formation from low resolution frames using Delaunay triangulation , 2002, IEEE Trans. Image Process..

[35]  Jun'ichi Tsujii,et al.  Design and Implementation of GXP Make -- A Workflow System Based on Make , 2010, 2010 IEEE Sixth International Conference on e-Science.

[36]  Satoshi Matsuoka,et al.  Ninf-G: A Reference Implementation of RPC-based Programming Middleware for Grid Computing , 2003, Journal of Grid Computing.

[37]  Luca Delle Monache,et al.  Short-term photovoltaic power forecasting using Artificial Neural Networks and an Analog Ensemble , 2017 .