TASKWORK: A Cloud-aware Runtime System for Elastic Task-parallel HPC Applications

With the capability of employing virtually unlimited compute resources, the cloud evolved into an attractive execution environment for applications from the High Performance Computing (HPC) domain. By means of elastic scaling, compute resources can be provisioned and decommissioned at runtime. This gives rise to a new concept in HPC: Elasticity of parallel computations. However, it is still an open research question to which extent HPC applications can benefit from elastic scaling and how to leverage elasticity of parallel computations. In this paper, we discuss how to address these challenges for HPC applications with dynamic task parallelism and present TASKWORK, a cloud-aware runtime system based on our findings. TASKWORK enables the implementation of elastic HPC applications by means of higher-level development frameworks and solves corresponding coordination problems based on Apache ZooKeeper. For evaluation purposes, we discuss a development framework for parallel branch-and-bound based on TASKWORK, show how to implement an elastic HPC application, and report on measurements with respect to parallel efficiency and elastic scaling.

[1]  Robert D. Blumofe,et al.  Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[2]  Wolfgang Blochinger,et al.  A Desktop Grid enabled parallel Barnes-Hut algorithm , 2012, 2012 IEEE 31st International Performance Computing and Communications Conference (IPCCC).

[3]  Wolfgang Blochinger,et al.  COHESION - A microkernel based Desktop Grid platform for irregular task-parallel applications , 2008, Future Gener. Comput. Syst..

[4]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[5]  George Karypis,et al.  Introduction to Parallel Computing Solution Manual , 2003 .

[6]  Wolfgang Blochinger,et al.  Aspect-Oriented Parallel Discrete Optimization on the Cohesion Desktop Grid Platform , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[7]  Douglas Thain,et al.  Designing Self-Tuning Split-Map-Merge Applications for High Cost-Efficiency in the Cloud , 2017, IEEE Transactions on Cloud Computing.

[8]  Jeff T. Linderoth,et al.  Solving large quadratic assignment problems on computational grids , 2002, Math. Program..

[9]  R. Prim Shortest connection networks and some generalizations , 1957 .

[10]  Douglas Thain,et al.  Converting a High Performance Application to an Elastic Cloud Application , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[11]  Patrick Maier,et al.  Replicable parallel branch and bound search , 2017, J. Parallel Distributed Comput..

[12]  Wolfgang Blochinger,et al.  AUTOGENIC: Automated Generation of Self-configuring Microservices , 2018, CLOSER.

[13]  P. Gács,et al.  Algorithms , 1992 .

[14]  Wolfgang Blochinger,et al.  Migrating parallel applications to the cloud: assessing cloud readiness based on parallel design decisions , 2019, SICS Software-Intensive Cyber-Physical Systems.

[15]  Herbert Kuchen,et al.  Algorithmic Skeletons for Branch and Bound , 2006, ICSOFT.

[16]  Wolfgang Blochinger,et al.  TOSCA-based container orchestration on Mesos , 2017, Computer Science - Research and Development.

[17]  Ronald L. Rivest,et al.  Introduction to Algorithms, 3rd Edition , 2009 .

[18]  Dejan S. Milojicic,et al.  Evaluating and Improving the Performance and Scheduling of HPC Applications in Cloud , 2016, IEEE Transactions on Cloud Computing.

[19]  Manish Parashar,et al.  Cloud Paradigms and Practices for Computational and Data-Enabled Science and Engineering , 2013, Computing in Science & Engineering.

[20]  Bruno Schulze,et al.  An Analysis of Public Clouds Elasticity in the Execution of Scientific Applications: a Survey , 2016, Journal of Grid Computing.

[21]  Rajkumar Buyya,et al.  HPC Cloud for Scientific and Business Applications , 2017, ACM Comput. Surv..

[22]  Flavio Junqueira,et al.  ZooKeeper: Distributed Process Coordination , 2013 .

[23]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[24]  Dejan S. Milojicic,et al.  Improving HPC Application Performance in Cloud through Dynamic Load Balancing , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[25]  David Cunningham,et al.  Resilient X10: efficient failure-aware programming , 2014, PPoPP '14.

[26]  Dejan S. Milojicic,et al.  The Who, What, Why, and How of High Performance Computing in the Cloud , 2013, 2013 IEEE 5th International Conference on Cloud Computing Technology and Science.

[27]  Cristiano André da Costa,et al.  AutoElastic: Automatic Resource Elasticity for High Performance Applications in the Cloud , 2016, IEEE Transactions on Cloud Computing.

[28]  Wolfgang Blochinger,et al.  Cost-efficient parallel processing of irregularly structured problems in cloud computing environments , 2018, Cluster Computing.