OpenMP scalability limits on large SMPs and how to extend them

The most widely used node type in high-performance computing nowadays is a 2-socket server node. These nodes are coupled to clusters with thousands of nodes via a fast interconnect, e.g. Infiniband. To program these clusters the Message Passing Interface (MPI) became the de-facto standard. However, MPI requires a very explicit expression of data layout and data transfer in a parallel program which often requires the rewriting of an application to parallelize it. An alternative to MPI is OpenMP, which allows to incrementally parallelize a serial application by adding pragmas to compute-intensive regions of the code. This is often more feasibly than rewriting the application with MPI. The disadvantage of OpenMP is that it requires a shared memory and thus cannot be used between nodes of a cluster. However, different hardware vendors offer large machines with a shared memory between all cores of the system. However, maintaining coherency between memory and all cores of the system is a challenging task and so these machines have different characteristics compared to the standard 2-socket servers. These characteristics must be taken into account by a programmer to achieve good performance on such a system. In this work, I will investigate different large shared memory machines to highlight these characteristics and I will show how these characteristics can be handled in OpenMP programs. When OpenMP is not able to handle different problems, I will present solutions in user space, which could be added to OpenMP for a better support of large systems. Furthermore, I will present a tools-guided workflow to optimize applications for such machines. I will investigate the ability of performance tools to highlight performance issues and I will present improvements for such tools to handle OpenMP tasks. These improvements allow to investigate the efficiency of task-parallel execution, especially for large shared memory machines. The workflow also contains a performance model to find out how well the performance of an application is on a system and when to stop tuning the application. Finally, I will present two application case studies where user codes have been optimized to reach a good performance by applying the optimization techniques presented in this thesis.

[1]  Willy Zwaenepoel,et al.  OpenMP on Networks of Workstations , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[2]  Dirk Schmidl,et al.  Evaluating OpenMP Performance on Thousands of Cores on the Numascale Architecture , 2015, PARCO.

[3]  William Gropp,et al.  Locality-Optimized Mixed Static/Dynamic Scheduling for Improving Load Balancing on SMPs , 2014, EuroMPI/ASIA.

[4]  J. Mark Bull,et al.  A microbenchmark suite for OpenMP 2.0 , 2001, CARN.

[5]  Eduard Ayguadé,et al.  Leveraging Transparent Data Distribution in OpenMP via User-Level Dynamic Page Migration , 2000, ISHPC.

[6]  Michael Klemm,et al.  OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison , 2012, MARC@RWTH.

[7]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[8]  Dirk Schmidl,et al.  Binding Nested OpenMP Programs on Hierarchical Memory Architectures , 2010, IWOMP.

[9]  Brice Goglin,et al.  Dynamic Task and Data Placement over NUMA Architectures: An OpenMP Runtime Perspective , 2009, IWOMP.

[10]  Fiona Reid,et al.  A Microbenchmark Suite for OpenMP Tasks , 2012, IWOMP.

[11]  Georg Hager,et al.  Introducing a Performance Model for Bandwidth-Limited Loop Kernels , 2009, PPAM.

[12]  Martin Oberlack,et al.  Extensive strain along gradient trajectories in the turbulent kinetic energy field , 2011 .

[13]  Barbara M. Chapman,et al.  A Runtime Implementation of OpenMP Tasks , 2011, IWOMP.

[14]  Dirk Schmidl,et al.  An OpenMP Extension Library for Memory Affinity , 2014, IWOMP.

[15]  Haoqiang Jin,et al.  Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster , 2003 .

[16]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[17]  Lipo Wang,et al.  Dissipation element analysis of scalar fields in turbulence , 2006 .

[18]  Gerhard Wellein,et al.  Introduction to High Performance Computing for Scientists and Engineers , 2010, Chapman and Hall / CRC computational science series.

[19]  William Gropp,et al.  Hybrid Static/dynamic Scheduling for Already Optimized Dense Matrix Factorization , 2011, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[20]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[21]  Dirk Schmidl,et al.  First Experiences with Intel Cluster OpenMP , 2008, IWOMP.

[22]  Sabela Ramos,et al.  Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi , 2013, HPDC.

[23]  Jonathan Harris,et al.  Extending OpenMP For NUMA Machines , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[24]  Christine Morin,et al.  A Case for Single System Image Cluster Operating Systems: The Kerrighed Approach , 2003, Parallel Process. Lett..

[25]  Matthias S. Müller,et al.  SPEC OMP2012 - An Application Benchmark Suite for Parallel Systems Using OpenMP , 2012, IWOMP.

[26]  Eduard Ayguadé,et al.  Exploiting multiple levels of parallelism in OpenMP: a case study , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[27]  Dirk Schmidl,et al.  How to Reconcile Event-Based Performance Analysis with Tasking in OpenMP , 2010, IWOMP.

[28]  Anthony Skjellum,et al.  Using MPI: portable parallel programming with the message-passing interface, 2nd Edition , 1999, Scientific and engineering computation series.

[29]  Dirk Schmidl,et al.  Score-P: A Unified Performance Measurement System for Petascale Applications , 2010, CHPC.

[30]  Christine Morin,et al.  Towards an efficient single system image cluster operating system , 2002, Fifth International Conference on Algorithms and Architectures for Parallel Processing, 2002. Proceedings..

[31]  Bernd Mohr,et al.  Design and Prototype of a Performance Tool Interface for OpenMP , 2002, The Journal of Supercomputing.

[32]  Dirk Schmidl,et al.  Suitability of Performance Tools for OpenMP Task-Parallel Programs , 2013, Parallel Tools Workshop.

[33]  Dirk Schmidl,et al.  Assessing the Performance of OpenMP Programs on the Intel Xeon Phi , 2013, Euro-Par.

[34]  Christian Terboven,et al.  The Design of OpenMP Thread Affinity , 2012, IWOMP.

[35]  Dirk Schmidl,et al.  Data and thread affinity in openmp programs , 2008, MAW '08.

[36]  Christine Morin,et al.  Kerrighed: A SSI Cluster OS Running OpenMP , 2003 .

[37]  Dirk Schmidl,et al.  Visualization of Memory Access Behavior on Hierarchical NUMA Architectures , 2014, 2014 First Workshop on Visual Performance Analysis.

[38]  Dirk Schmidl,et al.  Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir , 2011, Parallel Tools Workshop.

[39]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[40]  Carl Staelin,et al.  lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.

[41]  Wolfgang E. Nagel,et al.  VAMPIR: Visualization and Analysis of MPI Resources , 2010 .

[42]  Lisa Noordergraaf,et al.  Performance experiences on Sun's Wildfire prototype , 1999, SC '99.

[43]  Dirk Schmidl,et al.  Performance Analysis Techniques for Task-Based OpenMP Applications , 2012, IWOMP.

[44]  Bernd Mohr,et al.  The Scalasca performance toolset architecture , 2010, Concurr. Comput. Pract. Exp..

[45]  Bo Wang,et al.  Evaluating the Energy Consumption of OpenMP Applications on Haswell Processors , 2015, IWOMP.

[46]  Dirk Schmidl,et al.  Scaling OpenMP Programs to Thousand Cores on the Numascale Architecture , 2014 .

[47]  Dirk Schmidl,et al.  Assessing OpenMP Tasking Implementations on NUMA Architectures , 2012, IWOMP.

[48]  Thomas Bemmerl,et al.  Affinity-On-Next-Touch: An Extension to the Linux Kernel for NUMA Architectures , 2009, PPAM.

[49]  Dirk Schmidl,et al.  How to Scale Nested OpenMP Applications on the ScaleMP vSMP Architecture , 2010, 2010 IEEE International Conference on Cluster Computing.

[50]  Mitsuhisa Sato,et al.  Cluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system , 2001, Sci. Program..

[51]  Alejandro Duran,et al.  Evaluation of OpenMP Task Scheduling Strategies , 2008, IWOMP.

[52]  Dirk Schmidl,et al.  Task-Parallel Programming on NUMA Architectures , 2012, Euro-Par.

[53]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[54]  Dirk Schmidl,et al.  Trajectory-Search on ScaleMP's vSMP Architecture , 2011, International Conference on Parallel Computing.

[55]  Rudolf Eigenmann,et al.  SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance , 2001, WOMPAT.

[56]  Dirk Schmidl,et al.  Performance Characteristics of Large SMP Machines , 2013, IWOMP.

[57]  Dirk Schmidl,et al.  Towards a Performance Engineering Workflow for OpenMP 4.0 , 2013, PARCO.

[58]  Sverker Holmgren,et al.  affinity-on-next-touch: increasing the performance of an industrial PDE solver on a cc-NUMA system , 2005, ICS '05.

[59]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[60]  Amnon Barak,et al.  The MOSIX Distributed Operating System: Load Balancing for UNIX , 1993 .

[61]  Torsten Hoefler,et al.  Using automated performance modeling to find scalability bugs in complex codes , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).