Interactive locality optimization on NUMA architectures

Optimizing the performance of shared-memory NUMA programs remains something of a black art, requiring that application writers possess deep understanding of their programs' behaviors. This difficulty represents one of the remaining hindrances to the widespread adoption and deployment of these cost-efficient and scalable shared-memory NUMA architectures. To address this problem, we have developed a performance monitoring infrastructure and a corresponding set of tools to aid in visualizing and understanding the subtleties of the memory access behavior of parallel NUMA applications with large datasets. The tools are designed to be general, interoperable, and easily portable. We give detailed examples of the use of one particular tool in the set. We have used this memory access visualization tool profitably on a range of applications, improving performance by around 90%, on average.

[1]  Martin Schulz,et al.  Improving Data Locality Using Dynamic Page Migration Based on Memory Access Histograms , 2002, International Conference on Computational Science.

[2]  D.A. Reed,et al.  Scalable performance analysis: the Pablo performance analysis environment , 1993, Proceedings of Scalable Parallel Libraries Conference.

[3]  Katherine A. Yelick,et al.  Analyses and Optimizations for Shared Address Space Programs , 1996, J. Parallel Distributed Comput..

[4]  Hermann Hellwagner,et al.  SCI: Scalable Coherent Interface: Architecture and Software for High-Performance Compute Clusters , 1999 .

[5]  Emilio L. Zapata,et al.  An Automatic Iteration/Data Distribution Method Based on Access Descriptors for DSMM , 1999, LCPC.

[6]  Barton P. Miller,et al.  The Paradyn Parallel Performance Measurement Tool , 1995, Computer.

[7]  Anoop Gupta,et al.  The Stanford FLASH Multiprocessor , 1994, ISCA.

[8]  Roland Wismüller Interoperability Support in the Distributed Monitoring System OCM , 1999 .

[9]  Gordon Stoll,et al.  Performance analysis and visualization of parallel systems using SimOS and Rivet: a case study , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[10]  James K. Archibald A cache coherence approach for large multiprocessor systems , 1988, ICS '88.

[11]  Barton P. Miller,et al.  IPS-2: The Second Generation of a Parallel Program Measurement System , 1990, IEEE Trans. Parallel Distributed Syst..

[12]  Martin Schulz,et al.  A simulation tool for evaluating shared memory systems , 2003, 36th Annual Simulation Symposium, 2003..

[13]  Martin Schulz,et al.  Visualizing the Memory Access Behavior of Shared Memory Applications on NUMA Architectures , 2001, International Conference on Computational Science.

[14]  B. Miller,et al.  The Paradyn Parallel Performance Measurement Tools , 1995 .

[15]  J. Tao,et al.  Improving the Scalability of Shared Memory Systems through Relaxed Consistency , 2002 .

[16]  Martin Schulz,et al.  Design and Implementation Aspects for the SMiLE Hardware Monitor , 2000 .

[17]  Guy Lemieux,et al.  Design and implementation of the NUMAchine multiprocessor , 1998, Proceedings 1998 Design and Automation Conference. 35th DAC. (Cat. No.98CH36175).

[18]  Anoop Gupta,et al.  The Stanford FLASH multiprocessor , 1994, ISCA '94.

[19]  Michael T. Heath,et al.  Visualizing the performance of parallel programs , 1991, IEEE Software.

[20]  Martin Schulz,et al.  SMiLE: An Integrated, Multi-Paradigm Software Infrastructure for SCI-Based Clusters , 2002, 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'02).

[21]  Michael Oberhuber,et al.  The Tool-set Project: Towards an Integrated Tool Environment for Parallel Programming , 1997 .

[22]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[23]  Marco Zagha,et al.  OriginTM 2000 and Onyx2® Performance Tuning and Optimization Guide , 1993 .

[24]  Daniel A. Reed,et al.  An approach to immersive performance visualization of parallel and wide-area distributed applications , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).