Thread Progress Aware Coherence Adaption for Hybrid Cache Coherence Protocols

For chip multiprocessor systems (CMPs), the interference on shared resources such as on-chip caches typically leads to unbalanced progress among threads. Because of the inherent synchronization primitives, such as barriers and locks, cores running fast threads have to waste precious cycles to wait for cores with slow progress, which leads to performance and energy inefficiency. For the purpose of improving performance and reducing energy consumption, this paper proposes to adapt the cache coherence policy for threads according to their delay-tolerant levels. Specifically, this paper proposes Thread progrEss Aware Coherence Adaption (TEACA) which utilizes the thread progress information as hints for coherence adaption. TEACA dynamically utilize the memory system statistics to estimate the progress of threads. Based on the estimated thread progress information, TEACA categorizes threads into leader threads and laggard threads. The thread categorization decisions are then leveraged for efficient coherence adaption on CMP systems supporting hybrid coherence protocols. Experimental results show that, on a 64-core CMP system, TEACA outperforms directory protocol in application execution time and a recently proposed hybrid protocol in both application execution time and energy dissipation.

[1]  William J. Dally,et al.  Route packets, not wires: on-chip inteconnection networks , 2001, DAC '01.

[2]  Milo M. K. Martin,et al.  Token tenure: PATCHing token counting using directory-based cache coherence , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[3]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[4]  Chris Fallin,et al.  Parallel application memory scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Hu Chuan-Gan,et al.  On The Shift Register Sequences , 2004 .

[6]  Jianhua Li,et al.  TEACA: Thread ProgrEss Aware Coherence Adaption for hybrid coherence protocols , 2012, 2012 IEEE 10th Symposium on Embedded Systems for Real-time Multimedia.

[7]  David A. Wood,et al.  A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[8]  Natalie D. Enright Jerger,et al.  Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support , 2008, 2008 International Symposium on Computer Architecture.

[9]  Onur Mutlu,et al.  Accelerating critical section execution with asymmetric multi-core architectures , 2009, ASPLOS.

[10]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[11]  Margaret Martonosi,et al.  Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors , 2009, ISCA '09.

[12]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[13]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[14]  原田 秀逸 私の computer 環境 , 1998 .

[15]  Kunle Olukotun,et al.  A Single-Chip Multiprocessor , 1997, Computer.

[16]  Milo M. K. Martin,et al.  Token Coherence: decoupling performance and correctness , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[17]  Henry Hoffmann,et al.  On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.

[18]  Jianhua Li,et al.  LADPM: Latency-Aware Dual-Partition Multicast Routing for Mesh-Based Network-on-Chips , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[19]  Onur Mutlu,et al.  Bottleneck identification and scheduling in multithreaded applications , 2012, ASPLOS XVII.

[20]  Theodore R. Bashkow,et al.  A large scale, homogeneous, fully distributed parallel machine, I , 1977, ISCA '77.

[21]  Milo M. K. Martin,et al.  Bandwidth adaptive snooping , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[22]  Andrew B. Kahng,et al.  ORION 2.0: A Power-Area Simulator for Interconnection Networks , 2012, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[23]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[24]  Alan Jay Smith,et al.  A class of compatible cache consistency protocols and their support by the IEEE futurebus , 1986, ISCA '86.

[25]  Norman P. Jouppi,et al.  Architecting Efficient Interconnects for Large Caches with CACTI 6.0 , 2008, IEEE Micro.

[26]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[27]  José González,et al.  Meeting points: Using thread criticality to adapt multicore hardware to parallel regions , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[28]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.