Temporal-Aware Mechanism to Detect Private Data in Chip Multiprocessors

Most of the data referenced by sequential and parallel applications running in current chip multiprocessors are referenced by only one thread and can be considered as private data. A lot of recent proposals leverage this observation to improve many aspects of chip multiprocessors, such as reducing coherence overhead or the access latency to distributed caches. The effectiveness of those proposals depend to a large extent on the amount of detected private data. However, the mechanisms proposed so far do not consider thread migration and the private use of data within different application phases. As a result, a considerable amount of data is not detected as private. In order to make this detection more accurate and reaching more significant improvements, we propose a mechanism that is able to account for both thread migration and private data within application phases. Simulation results for 16-core systems show that, thanks to our mechanism, the average number of pages detected as private significantly increases from 43% in previous proposals up to 74% in ours. Finally, when our detection mechanism is used to deactivate the coherence for private data in a directory protocol, our proposal improves execution time by 13% with respect to previous proposals.

[1]  Stefanos Kaxiras,et al.  Complexity-effective multicore coherence , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[2]  Mikko H. Lipasti,et al.  Coarse-Grain Coherence Tracking: RegionScout and Region Coherence Arrays , 2006, IEEE Micro.

[3]  Mikko H. Lipasti,et al.  Improving multiprocessor performance with coarse-grain coherence tracking , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[4]  Andreas Moshovos RegionScout: exploiting coarse grain sharing in snoop-based coherence , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[5]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[6]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[7]  Jaehyuk Huh,et al.  Subspace snooping: Filtering snoops with operating system support , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[8]  Margaret Martonosi,et al.  Inter-core cooperative TLB for chip multiprocessors , 2010, ASPLOS XV.

[9]  Brad Calder,et al.  Basic block distribution analysis to find periodic behavior and simulation points in applications , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[10]  Pat Conway,et al.  Blade computing with the AMD Opteron™ processor ("magny-cours") , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[11]  Niraj K. Jha,et al.  In-Network Snoop Ordering (INSO): Snoopy coherence on unordered interconnects , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[12]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[13]  Margaret Martonosi,et al.  Shared last-level TLBs for chip multiprocessors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[14]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[15]  Rami G. Melhem,et al.  Compiler-assisted data distribution for chip multiprocessors , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[16]  Seth H. Pugsley,et al.  SWEL: Hardware cache coherence protocols to map shared data onto shared caches , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[17]  Mohammad Alisafaee Spatiotemporal Coherence Tracking , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[18]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[19]  Antonio Robles,et al.  Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[20]  Babak Falsafi,et al.  Cuckoo directory: A scalable directory for many-core systems , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[21]  Shyamkumar Thoziyoor,et al.  CACTI 5 . 1 , 2008 .

[22]  Yen-Kuang Chen,et al.  The ALPBench benchmark suite for complex multimedia applications , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[23]  Alan L. Cox,et al.  Translation caching: skip, don't walk (the page table) , 2010, ISCA.

[24]  Sangyeun Cho,et al.  Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[25]  Satish Narayanasamy,et al.  End-to-end sequential consistency , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[26]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[27]  Milo M. K. Martin,et al.  Token Coherence: decoupling performance and correctness , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[28]  Kevin Skadron,et al.  Avoiding cache thrashing due to private data placement in last-level cache for manycore scaling , 2009, 2009 IEEE International Conference on Computer Design.

[29]  Michael C. Huang,et al.  POPS: Coherence Protocol Optimization for Both Private and Shared Data , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[30]  Rami G. Melhem,et al.  Practically Private: Enabling high performance CMPs through compiler-assisted data classification , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[31]  Melanie Kambadur,et al.  Harmony: Collection and analysis of parallel block vectors , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[32]  Andreas Moshovos,et al.  A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[33]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[34]  Margaret Martonosi,et al.  Cache decay: exploiting generational behavior to reduce cache leakage power , 2001, ISCA 2001.

[35]  Mahmut T. Kandemir,et al.  Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[36]  Min Xu,et al.  Evaluating Non-deterministic Multi-threaded Commercial Workloads , 2001 .

[37]  Daniel J. Sorin,et al.  UNified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[38]  Antonio Robles,et al.  Increasing the Effectiveness of Directory Caches by Avoiding the Tracking of Noncoherent Memory Blocks , 2013, IEEE Transactions on Computers.

[39]  Mahmut T. Kandemir,et al.  Organizing the last line of defense before hitting the memory wall for CMPs , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[40]  Vijayalakshmi Srinivasan,et al.  A Tagless Coherence Directory , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).