Exploring Thread and Memory Placement on NUMA Architectures: Solaris and Linux, UltraSPARC/FirePlane and Opteron/HyperTransport

Modern shared memory multiprocessor systems commonly have non-uniform memory access (NUMA) with asymmetric memory bandwidth and latency characteristics. Operating systems now provide application programmer interfaces allowing the user to perform specific thread and memory placement. To date, however, there have been relatively few detailed assessments of the importance of memory/thread placement for complex applications. This paper outlines a framework for performing memory and thread placement experiments on Solaris and Linux. Thread binding and location specific memory allocation and its verification is discussed and contrasted. Using the framework, the performance characteristics of serial versions of lmbench, Stream and various BLAS libraries (ATLAS, GOTO, ACML on Opteron/Linux and Sunperf on Opteron, UltraSPARC/Solaris) are measured on two different hardware platforms (UltraSPARC/FirePlane and Opteron/HyperTransport). A simple model describing performance as a function of memory distribution is proposed and assessed for both the Opteron and UltraSPARC.

[1]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[2]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[3]  Rohit Chandra,et al.  Parallel programming in openMP , 2000 .

[4]  Eduard Ayguadé,et al.  Leveraging Transparent Data Distribution in OpenMP via User-Level Dynamic Page Migration , 2000, ISHPC.

[5]  Carl Staelin,et al.  lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.

[6]  Tim Brecht,et al.  On the importance of parallel application placement in NUMA multiprocessors , 1993 .

[7]  Alistair P. Rendell,et al.  OpenMP and NUMA Architectures I: Investigating Memory Placement on the SCI Origin 3000 , 2003, International Conference on Computational Science.

[8]  Jack J. Dongarra,et al.  A Portable Programming Interface for Performance Evaluation on Modern Processors , 2000, Int. J. High Perform. Comput. Appl..

[9]  Anoop Gupta,et al.  Parallel computer architecture - a hardware / software approach , 1998 .

[10]  David R. Butenhof Programming with POSIX threads , 1993 .

[11]  Pat Conway,et al.  The AMD Opteron Processor for Multiprocessor Servers , 2003, IEEE Micro.

[12]  Jeffrey K. Hollingsworth,et al.  Using Hardware Counters to Automatically Improve Memory Performance , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[13]  A. Charlesworth The Sun Fireplane System Interconnect , 2001, ACM/IEEE SC 2001 Conference (SC'01).