Scaling OpenMP Programs to Thousand Cores on the Numascale Architecture

The downside of shared memory programming compared to message passing is the limitation to run on a single system, whereas message passing allows to run applications on a cluster of shared memory nodes. Numascale’s interconnect couples several machines in a cache coherent way to form a single system on multiple boards which allows shared memory programming on the complete machine. However, this does not necessarily mean that shared memory programs deliver satisfying performance on such a system. In this work we investigated a Numascale machine with 1728 cores hosted at the University of Oslo’s Center for Information Technology. The system consists of 72 IBM x3755 M3 nodes coupled in a 3D torus network topology with Numascale’s interconnect. We investigate the memory bandwidth with kernel benchmarks and furthermore look at an application developed at the Institute of Combustion Technology at RWTH Aachen University, namely TrajSearch. We describe all tuning steps done so far to optimize the application for large SMP machines like the Numascale machine and present good performance results for OpenMP runs with 1024 threads on the Oslo system. The structure of this work is as follows: first, we describe the Numascale technology and the test system in Sec. II. Second, we present performance results on the system and describe our tuning steps for the TrajSearch code in Sec. III before we conclude and discuss future steps needed in Sec. IV.