Modeling and Stack Simulation of CMP Cache Capacity and Accessibility

Performance trade-offs between fast data access by local data replication and cache capacity maximization by global data sharing have been extensively studied for many-core Chip Multiprocessors (CMPs). Costly simulations over a wide spectrum of the design space are generally required to gain insight for a sound design. To lower the cost, we develop an abstract model for understanding the performance impact of data replication on CMP caches. To overcome the lack of real-time interactions among multiple cores in the model, we further develop an efficient single-pass stack simulation to study the performance of CMP cache organizations with various degrees of data replication. The global stack logically incorporates a shared stack and per-core private stacks; shared/private reuse (stack) distances can be collected in a single-pass simulation. With the reuse distances, one can calculate the performance of CMP cache organizations with various degrees of data replication. We verify both the model and the stack simulation against execution-driven simulations with commercial multithreaded workloads. The results show that the abstract model provides accurate information about performance trade-offs of data replication. The stack simulation accurately predicts the performance of various cache organizations with 2-9 percent error margins using only about 8 percent of the simulation time.

[1]  Mark Horowitz,et al.  An analytical cache model , 1989, TOCS.

[2]  Yan Solihin,et al.  Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.

[3]  G. Edward Suh,et al.  Analytical cache models with applications to cache partitioning , 2001, ICS '01.

[4]  Erik Hagersten,et al.  A statistical multiprocessor cache model , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[5]  Zhen Yang,et al.  CMP cache performance projection: accessibility vs. capacity , 2007, CARN.

[6]  Santosh G. Abraham,et al.  Set-associative cache simulation using generalized binomial trees , 1995, TOCS.

[7]  Philip G. Emma,et al.  Cache miss behavior: is it sqrt(2)? , 2006 .

[8]  Emilio L. Zapata,et al.  Automatic analytical modeling for the estimation of cache misses , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[9]  David A. Wood,et al.  Implementing stack simulation for highly-associative memories , 1991, SIGMETRICS '91.

[10]  Jingling Xue,et al.  Let's study whole-program cache behaviour analytically , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[11]  Vijayalakshmi Srinivasan,et al.  On the Nature of Cache Miss Behavior: Is It √2? , 2008, J. Instr. Level Parallelism.

[12]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[13]  Zeshan Chishti,et al.  Optimizing replication, communication, and capacity allocation in CMPs , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[14]  Mahmut T. Kandemir,et al.  Organizing the last line of defense before hitting the memory wall for CMPs , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[15]  Krste Asanovic,et al.  Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[16]  Zhen Yang,et al.  Modeling and Single-Pass Simulation of CMP Cache Capacity and Accessibility , 2007, 2007 IEEE International Symposium on Performance Analysis of Systems & Software.

[17]  Alan Jay Smith,et al.  Evaluating Associativity in CPU Caches , 1989, IEEE Trans. Computers.

[18]  Lixin Zhang,et al.  Adaptive mechanisms and policies for managing cache hierarchies in chip multiprocessors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[19]  Michel Dubois,et al.  Simple Penalty-Sensitive Cache Replacement Policies , 2008, J. Instr. Level Parallelism.

[20]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[21]  G. Edward Suh,et al.  Dynamic Partitioning of Shared Cache Memory , 2004, The Journal of Supercomputing.

[22]  Yarsun Hsu,et al.  Efficient Stack Simulation for Shared Memory Set-Associative Multiprocessor Caches , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[23]  David A. Wood,et al.  Managing Wire Delay in Large Chip-Multiprocessor Caches , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[24]  Vincent J. Kruskal,et al.  LRU Stack Processing , 1975, IBM J. Res. Dev..

[25]  Jaehyuk Huh,et al.  A NUCA Substrate for Flexible CMP Cache Sharing , 2007, IEEE Transactions on Parallel and Distributed Systems.

[26]  David A. Wood,et al.  ASR: Adaptive Selective Replication for CMP Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[27]  Alan Jay Smith,et al.  Efficient Analysis of Caching Systems , 1987 .

[28]  Richard R. Muntz,et al.  Stack Evaluation of Arbitrary Set-Associative Multiprocessor Caches , 1995, IEEE Trans. Parallel Distributed Syst..

[29]  Erik Hagersten,et al.  StatCache: a probabilistic approach to efficient and accurate data locality analysis , 2004, IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004.

[30]  Dean M. Tullsen,et al.  Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[31]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[32]  Jichuan Chang,et al.  Cooperative Caching for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).