The DEEP-ER Project: I/O and Resiliency Extensions for the Cluster-Booster Architecture

The recently completed research project DEEP-ER has developed a variety of hardware and software technologies to improve the I/O capabilities of next generation high-performance computers, and to enable applications recovering from the larger hardware failure rates expected on these machines. The heterogeneous Cluster-Booster architecture - first introduced in the predecessor DEEP project - has been extended by a multi-level memory hierarchy employing non-volatile and network-attached memory devices. Based on this hardware infrastructure, an I/O and resiliency software stack has been implemented combining and extending well established libraries and software tools, and sticking to standard user-interfaces. Real-world scientific codes have tested the projects' developments and demonstrated the improvements achieved without compromising the portability of the applications.

[1]  Stefano Markidis,et al.  Multi-scale simulations of plasma with iPIC3D , 2010, Math. Comput. Simul..

[2]  Gene Cooperman,et al.  DMTCP: Transparent checkpointing for cluster computations and the desktop , 2007, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[3]  Dorian Krause,et al.  JURECA: General-purpose supercomputer at Jülich Supercomputing Centre , 2016 .

[4]  Felix Wolf,et al.  Scalable massively parallel I/O to task-local files , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[5]  N. Eicker,et al.  An accelerated Cluster-Architecture for the Exascale , 2011 .

[6]  Estela Suarez,et al.  Application Performance on a Cluster-Booster System , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[7]  Jesús Labarta,et al.  Collective Offload for Heterogeneous Clusters , 2015, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC).

[8]  Thomas Lippert,et al.  Supercomputing Evolution at JSC , 2018 .

[9]  S. Lanteri,et al.  Convergence of a Discontinuous Galerkin scheme for the mixed time domain Maxwell's equations in dispersive media. , 2013 .

[10]  Alejandro Duran,et al.  Adapting a Finite-Element Type Solver for Bioelectromagnetics to the DEEP-ER Platform , 2015, PARCO.

[11]  Thomas Lippert,et al.  The DEEP Project An alternative approach to heterogeneous cluster‐computing in the many‐core era , 2016, Concurr. Comput. Pract. Exp..

[12]  Sally A. McKee,et al.  Reflections on the memory wall , 2004, CF '04.

[13]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[14]  Jason Duell,et al.  Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .

[15]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[16]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.