论文信息 - OpenMP scalability limits on large SMPs and how to extend them

OpenMP scalability limits on large SMPs and how to extend them

The most widely used node type in high-performance computing nowadays is a 2-socket server node. These nodes are coupled to clusters with thousands of nodes via a fast interconnect, e.g. Infiniband. To program these clusters the Message Passing Interface (MPI) became the de-facto standard. However, MPI requires a very explicit expression of data layout and data transfer in a parallel program which often requires the rewriting of an application to parallelize it. An alternative to MPI is OpenMP, which allows to incrementally parallelize a serial application by adding pragmas to compute-intensive regions of the code. This is often more feasibly than rewriting the application with MPI. The disadvantage of OpenMP is that it requires a shared memory and thus cannot be used between nodes of a cluster. However, different hardware vendors offer large machines with a shared memory between all cores of the system. However, maintaining coherency between memory and all cores of the system is a challenging task and so these machines have different characteristics compared to the standard 2-socket servers. These characteristics must be taken into account by a programmer to achieve good performance on such a system. In this work, I will investigate different large shared memory machines to highlight these characteristics and I will show how these characteristics can be handled in OpenMP programs. When OpenMP is not able to handle different problems, I will present solutions in user space, which could be added to OpenMP for a better support of large systems. Furthermore, I will present a tools-guided workflow to optimize applications for such machines. I will investigate the ability of performance tools to highlight performance issues and I will present improvements for such tools to handle OpenMP tasks. These improvements allow to investigate the efficiency of task-parallel execution, especially for large shared memory machines. The workflow also contains a performance model to find out how well the performance of an application is on a system and when to stop tuning the application. Finally, I will present two application case studies where user codes have been optimized to reach a good performance by applying the optimization techniques presented in this thesis.

Dirk Schmidl | Dirk Schmidl

[1] Willy Zwaenepoel,et al. OpenMP on Networks of Workstations , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[2] Dirk Schmidl,et al. Evaluating OpenMP Performance on Thousands of Cores on the Numascale Architecture , 2015, PARCO.

[3] William Gropp,et al. Locality-Optimized Mixed Static/Dynamic Scheduling for Improving Load Balancing on SMPs , 2014, EuroMPI/ASIA.

[4] J. Mark Bull,et al. A microbenchmark suite for OpenMP 2.0 , 2001, CARN.

[5] Eduard Ayguadé,et al. Leveraging Transparent Data Distribution in OpenMP via User-Level Dynamic Page Migration , 2000, ISHPC.

[6] Michael Klemm,et al. OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison , 2012, MARC@RWTH.

[7] D. Lenoski,et al. The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[8] Dirk Schmidl,et al. Binding Nested OpenMP Programs on Hierarchical Memory Architectures , 2010, IWOMP.

[9] Brice Goglin,et al. Dynamic Task and Data Placement over NUMA Architectures: An OpenMP Runtime Perspective , 2009, IWOMP.

[10] Fiona Reid,et al. A Microbenchmark Suite for OpenMP Tasks , 2012, IWOMP.

[11] Georg Hager,et al. Introducing a Performance Model for Bandwidth-Limited Loop Kernels , 2009, PPAM.

[12] Martin Oberlack,et al. Extensive strain along gradient trajectories in the turbulent kinetic energy field , 2011 .

[13] Barbara M. Chapman,et al. A Runtime Implementation of OpenMP Tasks , 2011, IWOMP.

[14] Dirk Schmidl,et al. An OpenMP Extension Library for Memory Affinity , 2014, IWOMP.

[15] Haoqiang Jin,et al. Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster , 2003 .

[16] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[17] Lipo Wang,et al. Dissipation element analysis of scalar fields in turbulence , 2006 .

[18] Gerhard Wellein,et al. Introduction to High Performance Computing for Scientists and Engineers , 2010, Chapman and Hall / CRC computational science series.

[19] William Gropp,et al. Hybrid Static/dynamic Scheduling for Already Optimized Dense Matrix Factorization , 2011, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[20] David H. Bailey,et al. The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[21] Dirk Schmidl,et al. First Experiences with Intel Cluster OpenMP , 2008, IWOMP.

[22] Sabela Ramos,et al. Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi , 2013, HPDC.

[23] Jonathan Harris,et al. Extending OpenMP For NUMA Machines , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[24] Christine Morin,et al. A Case for Single System Image Cluster Operating Systems: The Kerrighed Approach , 2003, Parallel Process. Lett..

[25] Matthias S. Müller,et al. SPEC OMP2012 - An Application Benchmark Suite for Parallel Systems Using OpenMP , 2012, IWOMP.

[26] Eduard Ayguadé,et al. Exploiting multiple levels of parallelism in OpenMP: a case study , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[27] Dirk Schmidl,et al. How to Reconcile Event-Based Performance Analysis with Tasking in OpenMP , 2010, IWOMP.

[28] Anthony Skjellum,et al. Using MPI: portable parallel programming with the message-passing interface, 2nd Edition , 1999, Scientific and engineering computation series.

[29] Dirk Schmidl,et al. Score-P: A Unified Performance Measurement System for Petascale Applications , 2010, CHPC.

[30] Christine Morin,et al. Towards an efficient single system image cluster operating system , 2002, Fifth International Conference on Algorithms and Architectures for Parallel Processing, 2002. Proceedings..

[31] Bernd Mohr,et al. Design and Prototype of a Performance Tool Interface for OpenMP , 2002, The Journal of Supercomputing.

[32] Dirk Schmidl,et al. Suitability of Performance Tools for OpenMP Task-Parallel Programs , 2013, Parallel Tools Workshop.

[33] Dirk Schmidl,et al. Assessing the Performance of OpenMP Programs on the Intel Xeon Phi , 2013, Euro-Par.

[34] Christian Terboven,et al. The Design of OpenMP Thread Affinity , 2012, IWOMP.

[35] Dirk Schmidl,et al. Data and thread affinity in openmp programs , 2008, MAW '08.

[36] Christine Morin,et al. Kerrighed: A SSI Cluster OS Running OpenMP , 2003 .

[37] Dirk Schmidl,et al. Visualization of Memory Access Behavior on Hierarchical NUMA Architectures , 2014, 2014 First Workshop on Visual Performance Analysis.

[38] Dirk Schmidl,et al. Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir , 2011, Parallel Tools Workshop.

[39] Guillaume Mercier,et al. hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[40] Carl Staelin,et al. lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.

[41] Wolfgang E. Nagel,et al. VAMPIR: Visualization and Analysis of MPI Resources , 2010 .

[42] Lisa Noordergraaf,et al. Performance experiences on Sun's Wildfire prototype , 1999, SC '99.

[43] Dirk Schmidl,et al. Performance Analysis Techniques for Task-Based OpenMP Applications , 2012, IWOMP.

[44] Bernd Mohr,et al. The Scalasca performance toolset architecture , 2010, Concurr. Comput. Pract. Exp..

[45] Bo Wang,et al. Evaluating the Energy Consumption of OpenMP Applications on Haswell Processors , 2015, IWOMP.

[46] Dirk Schmidl,et al. Scaling OpenMP Programs to Thousand Cores on the Numascale Architecture , 2014 .

[47] Dirk Schmidl,et al. Assessing OpenMP Tasking Implementations on NUMA Architectures , 2012, IWOMP.

[48] Thomas Bemmerl,et al. Affinity-On-Next-Touch: An Extension to the Linux Kernel for NUMA Architectures , 2009, PPAM.

[49] Dirk Schmidl,et al. How to Scale Nested OpenMP Applications on the ScaleMP vSMP Architecture , 2010, 2010 IEEE International Conference on Cluster Computing.

[50] Mitsuhisa Sato,et al. Cluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system , 2001, Sci. Program..

[51] Alejandro Duran,et al. Evaluation of OpenMP Task Scheduling Strategies , 2008, IWOMP.

[52] Dirk Schmidl,et al. Task-Parallel Programming on NUMA Architectures , 2012, Euro-Par.

[53] Samuel Williams,et al. The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[54] Dirk Schmidl,et al. Trajectory-Search on ScaleMP's vSMP Architecture , 2011, International Conference on Parallel Computing.

[55] Rudolf Eigenmann,et al. SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance , 2001, WOMPAT.

[56] Dirk Schmidl,et al. Performance Characteristics of Large SMP Machines , 2013, IWOMP.

[57] Dirk Schmidl,et al. Towards a Performance Engineering Workflow for OpenMP 4.0 , 2013, PARCO.

[58] Sverker Holmgren,et al. affinity-on-next-touch: increasing the performance of an industrial PDE solver on a cc-NUMA system , 2005, ICS '05.

[59] Ramesh Subramonian,et al. LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[60] Amnon Barak,et al. The MOSIX Distributed Operating System: Load Balancing for UNIX , 1993 .

[61] Torsten Hoefler,et al. Using automated performance modeling to find scalability bugs in complex codes , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).