Feedback-Driven Restructuring of Multi-threaded Applications for NUCA Cache Performance in CMPs

This paper addresses feedback-directed restructuring techniques tuned to Non Uniform Cache Architectures (NUCA) in CMPs running multi-threaded applications. Access time to NUCA caches depends on the location of the referred block, so the locality and cache mapping of the application influence the overall performance. We show techniques for altering the distribution of applications into the cache space as to achieve improved average memory access time. In CMPs running multi-threaded applications, the aggregated accesses (and locality) of the processors form the actual cache load and pose specific issues. We consider a number of Splash-2 and Parsec benchmarks on an 8 processor system and we show that a relatively simple remapping algorithm is able to improve the average Static-NUCA (SNUCA) cache access time by 5.5% and allows an SNUCA cache to surpass the performance of a more complex dynamic-NUCA (DNUCA) for most benchmarks. Then, we present a more sophisticated remapping algorithm, relying on cache geometry information and on the access distribution statistics from individual processors, that reduces the average cache access time by 10.2% and is very stable across all benchmarks.

[1]  Cosimo Antonio Prete,et al.  A proposal for input-sensitivity analysis of profile-driven optimizations on embedded applications , 2004, SIGARCH Comput. Archit. News.

[2]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[3]  Pierfrancesco Foglia,et al.  Analysis of Performance Dependencies in NUCA-Based CMP Systems , 2009, 2009 21st International Symposium on Computer Architecture and High Performance Computing.

[4]  Sangyeun Cho,et al.  Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[5]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[6]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[7]  Xiaoning Ding,et al.  MESA: reducing cache conflicts by integrating static and run-time methods , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[8]  Erik Brunvand,et al.  Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[9]  Mainak Chaudhuri PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[10]  Rajeev Balasubramonian,et al.  Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[11]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[12]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[13]  Zhao Zhang,et al.  Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[14]  Lesley Anne Polka Package Technology to Address the Memory Bandwidth Challenge for Terascale Computing , 2007 .

[15]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[16]  M.J. Flynn,et al.  Deep submicron microprocessor design issues , 1999, IEEE Micro.

[17]  C. Prete,et al.  A proposal for input-sensitivity analysis of profile-driven optimizations on embedded applications , 2004, MEDEA '03.