Exposing the Locality of Heterogeneous Memory Architectures to HPC Applications

High-performance computing requires a deep knowledge of the hardware platform to fully exploit its computing power. The performance of data transfer between cores and memory is becoming critical. Therefore locality is a major area of optimization on the road to exascale. Indeed, tasks and data have to be carefully distributed on the computing and memory resources. We discuss the current way to expose processor and memory locality information in the Linux kernel and in user-space libraries such as the hwloc software project. The current de facto standard structural modeling of the platform as the tree is not perfect, but it offers a good compromise between precision and convenience for HPC runtimes. We present an in-depth study of the software view of the upcoming Intel Knights Landing processor. Its memory locality cannot be properly exposed to user-space applications without a significant rework of the current software stack. We propose an extension of the current hierarchical platform model in hwloc. It correctly exposes new heterogeneous architectures with high-bandwidth or non-volatile memories to applications, while still being convenient for affinity-aware HPC runtimes.

[1]  David Daly,et al.  The cache and memory subsystems of the IBM POWER8 processor , 2015, IBM J. Res. Dev..

[2]  Jeff R. Hammond,et al.  User Extensible Heap Manager for Heterogeneous Memory Platforms and Mixed Memory Policies , 2015 .

[3]  Jeffrey K. Hollingsworth,et al.  Using Hardware Counters to Automatically Improve Memory Performance , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[4]  Thomas R. Gross,et al.  Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead , 2011, ISMM '11.

[5]  Gerhard Wellein,et al.  LIKWID: Lightweight Performance Tools , 2011, CHPC.

[6]  Jean-François Méhaut,et al.  NUMA-ICTM: A parallel version of ICTM exploiting memory placement strategies for NUMA machines , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[7]  Philippe Olivier Alexandre Navaux,et al.  Multi-core aware process mapping and its impact on communication overhead of parallel applications , 2009, 2009 IEEE Symposium on Computers and Communications.

[8]  Viktor K. Prasanna,et al.  Optimizing graph algorithms for improved cache performance , 2002, IEEE Transactions on Parallel and Distributed Systems.

[9]  Jack J. Dongarra,et al.  Analytical modeling and optimization for affinity based thread scheduling on multicore systems , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[10]  Gerhard Wellein,et al.  LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[11]  Christoph Lameter,et al.  NUMA (Non-Uniform Memory Access): An Overview , 2013, ACM Queue.

[12]  Alistair P. Rendell,et al.  OpenMP and NUMA Architectures I: Investigating Memory Placement on the SCI Origin 3000 , 2003, International Conference on Computational Science.

[13]  Arun Jagatheesan,et al.  Understanding the Impact of Emerging Non-Volatile Memories on High-Performance, IO-Intensive Computing , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  François Pellegrini,et al.  Scotch and libScotch 5.0 User's Guide , 2007 .

[15]  William J. Bowhill,et al.  The Xeon® Processor E5-2600 v3: a 22 nm 18-Core Product Family , 2016, IEEE Journal of Solid-State Circuits.

[16]  Juan Touriño,et al.  Automatic mapping of parallel applications on multicore architectures using the Servet benchmark suite , 2012, Comput. Electr. Eng..

[17]  Jeffrey M. Squyres,et al.  Advancing application process affinity experimentation: open MPI's LAMA-based affinity interface , 2013, EuroMPI.

[18]  William J. Bowhill,et al.  4.5 The Xeon® processor E5-2600 v3: A 22nm 18-core product family , 2015, 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers.

[19]  Dong Li,et al.  Identifying Opportunities for Byte-Addressable Non-Volatile Memory in Extreme-Scale Scientific Applications , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[20]  Jeffrey K. Hollingsworth,et al.  Hardware monitors for dynamic page migration , 2008, J. Parallel Distributed Comput..

[21]  Tim Brecht,et al.  On the importance of parallel application placement in NUMA multiprocessors , 1993 .

[22]  Emmanuel Jeannot,et al.  Process Placement in Multicore Clusters:Algorithmic Issues and Practical Techniques , 2014, IEEE Transactions on Parallel and Distributed Systems.

[23]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[24]  Sadaf R. Alam,et al.  Characterization of Scientific Workloads on Systems with Multi-Core Processors , 2006, 2006 IEEE International Symposium on Workload Characterization.

[25]  Brice Goglin,et al.  ForestGOMP: An Efficient OpenMP Environment for NUMA Architectures , 2010, International Journal of Parallel Programming.

[26]  Simon David Hammond,et al.  memkind: An Extensible Heap Memory Manager for Heterogeneous Memory Platforms and Mixed Memory Policies. , 2015 .

[27]  Informatika Distributed Management Task Force , 2010 .

[28]  Rui Yang,et al.  Memory and Thread Placement Effects as a Function of Cache Usage: A Study of the Gaussian Chemistry Code on the SunFire X4600 M2 , 2008, 2008 International Symposium on Parallel Architectures, Algorithms, and Networks (i-span 2008).

[29]  Mark Giampapa,et al.  Experiences with a Lightweight Supercomputer Kernel: Lessons Learned from Blue Gene's CNK , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  Masha Sosonkina,et al.  Non-uniform Memory Affinity Strategy in Multi-Threaded Sparse Matrix Computations , 2011, HiPC 2012.