Compiler-based data classification for hybrid caching

As chip multiprocessor systems incorporate an increasing number of cores, memory access latency, impacted by on-chip communication and remote data cache access, is becoming a critical bottleneck. To combat the problem, advanced cache organizations have been proposed as alternatives to traditional private and static non-uniform cache access (e.g. distributed shared) architectures. In this paper, we demonstrate how using fairly simple compiler analysis memory accesses can be classified into private data access and shared data access. In addition, we introduce a third classification of probably private access and demonstrate the impact of this category compared to traditional private and shared. The memory access classification information from the compiler analysis is then provided to the runtime system through the page table to facilitate a hybrid private-shared caching technique. The proposed cache mechanism distinguishes data access patterns and adopts different placement and search policies accordingly to improve the performance. Our analysis demonstrates that many applications have a significant amount of both private and shared data and that compiler analysis can identify the private data effectively for many applications.

[1]  Sangyeun Cho,et al.  SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[2]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[3]  Krste Asanovic,et al.  Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[4]  T. N. Vijaykumar,et al.  Optimizing Replication, Communication, and Capacity Allocation in CMPs , 2005, ISCA 2005.

[5]  Rajeev Balasubramonian,et al.  Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[6]  Per Stenström,et al.  An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[7]  Rami G. Melhem,et al.  ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors , 2009, HiPEAC.

[8]  Jaehyuk Huh,et al.  A NUCA Substrate for Flexible CMP Cache Sharing , 2007, IEEE Transactions on Parallel and Distributed Systems.

[9]  Dean M. Tullsen,et al.  Proximity-aware directory-based coherence for multi-core processor architectures , 2007, SPAA '07.

[10]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[11]  Jichuan Chang,et al.  Cooperative Caching for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[12]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[13]  Babak Falsafi,et al.  R-NUCA: Data Placement in Distributed Shared Caches , 2009 .

[14]  Zeshan Chishti,et al.  Optimizing replication, communication, and capacity allocation in CMPs , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).