Hardware monitors for dynamic page migration

In this paper, we first introduce a profile-driven online page migration scheme and investigate its impact on the performance of multithreaded applications. We use centralized lightweight, inexpensive plug-in hardware monitors to profile the memory access behavior of an application, and then migrate pages to memory local to the most frequently accessing processor. We also investigate the use of several other potential sources of data gathered from hardware monitors and compare their effectiveness to using data from centralized hardware monitors. In particular, we investigate the effectiveness of using cache miss profiles, Translation Lookaside Buffer (TLB) miss profiles and the content of the on-chip TLBs using the valid bit information. Moreover, we also introduce a modest hardware feature, called Address Translation Counters (ATC), and compare its effectiveness with other sources of hardware profiles. Using the Dyninst runtime instrumentation combined with hardware monitors, we were able to add page migration capabilities to a Sun Fire 6800 server without having to modify the operating system kernel, or to re-compile application programs. Our dynamic page migration scheme reduced the total number of non-local memory accesses of applications by up to 90% and improved the execution times up to 16%. We also conducted a simulation based study and demonstrated that cache miss profiles gathered from on-chip CPU monitors, which are typically available in current microprocessors, can be effectively used to guide dynamic page migrations in applications.

[1]  Robert J. Fowler,et al.  NUMA policies and their relation to memory architecture , 1991, ASPLOS IV.

[2]  Frank Mueller,et al.  Hardware profile-guided automatic page placement for ccNUMA systems , 2006, PPoPP '06.

[3]  S. Turner,et al.  Performance Analysis Using the MIPS R10000 Performance Counters , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[4]  A. Charlesworth The Sun Fireplane System Interconnect , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[5]  Lisa Noordergraaf,et al.  Performance experiences on Sun's Wildfire prototype , 1999, SC '99.

[6]  Carla Schlatter Ellis,et al.  The robustness of NUMA memory management , 1991, SOSP '91.

[7]  Jeffrey K. Hollingsworth,et al.  Using Hardware Performance Monitors to Isolate Memory Bottlenecks , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[8]  Robert Zak,et al.  SMP System Interconnect Instrumentation for Performance Analysis , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[9]  Eduard Ayguadé,et al.  User-level dynamic page migration for multiprogrammed shared-memory multiprocessors , 2000, Proceedings 2000 International Conference on Parallel Processing.

[10]  Michael Woodacre The SGI® Altix 3000 Global Shared-Memory Architecture , 2003 .

[11]  Chris Johnson,et al.  Data Distribution , Migration and Replication on a cc-NUMA Architecture , 2002 .

[12]  Anoop Gupta,et al.  Operating system support for improving data locality on CC-NUMA compute servers , 1996, ASPLOS VII.

[13]  Jeffrey K. Hollingsworth,et al.  An API for Runtime Code Patching , 2000, Int. J. High Perform. Comput. Appl..

[14]  Erik Hagersten,et al.  WildFire: a scalable path for SMPs , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[15]  K HollingsworthJeffrey,et al.  An API for Runtime Code Patching , 2000 .

[16]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[17]  Jeffrey K. Hollingsworth,et al.  Using Hardware Counters to Automatically Improve Memory Performance , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[18]  Mark S. Squillante,et al.  Using Processor-Cache Affinity Information in Shared-Memory Multiprocessor Scheduling , 1993, IEEE Trans. Parallel Distributed Syst..

[19]  Anoop Gupta,et al.  Scheduling and page migration for multiprocessor compute servers , 1994, ASPLOS VI.