Thread Owned Block Cache: Managing Latency in Many-Core Architecture

Shared last level cache is crucial to performance. However, multithread program model incurs serious contention in shared cache. In this paper, to reduce average cache access latency, we propose two schemes. First, an implicitly dynamic cache partitioning scheme, i.e. block agglutinating. The purpose is to isolate conflicting data blocks. Second, a novel hardware buffer, called thread owned block cache, i.e. TOB Cache. The purpose is to store conflicting data blocks. Extensive analysis of the proposed schemes with Splash2 benchmarks and Bioinformatics workloads is performed using a cycle accurate many-core simulator. Experimental results show that the proposed schemes make conflict miss rate of shared cache reduced by 40% compared to traditional shared cache. Compared with victim cache, average load latency of shared cache and primary data cache is reduced by about 26% and 12%, respectively; primary data cache miss penalties are reduced by about 14%, and IPC is improved by 17%.

[1]  G. Edward Suh,et al.  Dynamic Partitioning of Shared Cache Memory , 2004, The Journal of Supercomputing.

[2]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[3]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[4]  Gu-Yeon Wei,et al.  Process Variation Tolerant 3T1D-Based Cache Architectures , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[5]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[6]  Jichuan Chang,et al.  Cooperative Caching for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[7]  Moinuddin K. Qureshi Adaptive Spill-Receive for robust high-performance caching in CMPs , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[8]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[9]  Mary Jane Irwin,et al.  A novel migration-based NUCA design for chip multiprocessors , 2008, HiPC 2008.

[10]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[11]  Lei Liu,et al.  Godson-T: An Efficient Many-Core Architecture for Parallel Program Executions , 2009, Journal of Computer Science and Technology.

[12]  Gregory F. Pfister,et al.  “Hot spot” contention and combining in multistage interconnection networks , 1985, IEEE Transactions on Computers.

[13]  Henry Hoffmann,et al.  Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[14]  José E. Moreira,et al.  Dissecting Cyclops: a detailed analysis of a multithreaded architecture , 2003, CARN.

[15]  José González,et al.  The design and performance of a conflict-avoiding cache , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[16]  Michael Zhang,et al.  Victim Migration: Dynamically Adapting Between Private and Shared CMP Caches , 2005 .

[17]  Jaehyuk Huh,et al.  A NUCA substrate for flexible CMP cache sharing , 2005, ICS.

[18]  Jichuan Chang,et al.  Cooperative cache partitioning for chip multiprocessors , 2007, ICS '07.

[19]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[20]  Wen Gao,et al.  Exploiting the kernel trick to correlate fragment ions for peptide identification via tandem mass spectrometry , 2004, Bioinform..

[21]  S. Kim,et al.  Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[22]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[23]  Dean M. Tullsen,et al.  Runtime identification of cache conflict misses: The adaptive miss buffer , 2001, TOCS.

[24]  Babak Falsafi,et al.  R-NUCA: Data Placement in Distributed Shared Caches , 2009 .

[25]  Mahmut T. Kandemir,et al.  Adaptive set pinning: managing shared caches in chip multiprocessors , 2008, ASPLOS.

[26]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[27]  Chuanjun Zhang Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[28]  Song Feng An Implicitly Dynamic Shared Cache Isolation in Many-Core Architecture , 2009 .