Predicting inter-thread cache contention on a chip multi-processor architecture

This paper studies the impact of L2 cache sharing on threads that simultaneously share the cache, on a chip multi-processor (CMP) architecture. Cache sharing impacts threads nonuniformly, where some threads may be slowed down significantly, while others are not. This may cause severe performance problems such as sub-optimal throughput, cache thrashing, and thread starvation for threads that fail to occupy sufficient cache space to make good progress. Unfortunately, there is no existing model that allows extensive investigation of the impact of cache sharing. To allow such a study, we propose three performance models that predict the impact of cache sharing on co-scheduled threads. The input to our models is the isolated L2 cache stack distance or circular sequence profile of each thread, which can be easily obtained on-line or off-line. The output of the models is the number of extra L2 cache misses for each thread due to cache sharing. The models differ by their complexity and prediction accuracy. We validate the models against a cycle-accurate simulation that implements a dual-core CMP architecture, on fourteen pairs of mostly SPEC benchmarks. The most accurate model, the inductive probability model, achieves an average error of only 3.9%. Finally, to demonstrate the usefulness and practicality of the model, a case study that details the relationship between an application's temporal reuse behavior and its cache sharing impact is presented.

[1]  Jaejin Lee,et al.  Using prime numbers for cache indexing to eliminate conflict misses , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[2]  Josep Torrellas,et al.  Automatic Code Mapping on an Intelligent Memory Architecture , 2001, IEEE Trans. Computers.

[3]  Allan Snavely Explorations in Symbiosis on two Multithreaded Architectures , 1999 .

[4]  Yong Luo,et al.  Performance Evaluation of the SGI Origin2000: A Memory-Centric Characterization of LANL ASCI Applications , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[5]  M TullsenDean,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000 .

[6]  S. Kim,et al.  Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[7]  Mark Horowitz,et al.  An analytical cache model , 1989, TOCS.

[8]  Jingling Xue,et al.  Let's study whole-program cache behaviour analytically , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[9]  Siddhartha Chatterjee,et al.  Exact analysis of the cache behavior of nested loops , 2001, PLDI '01.

[10]  Christopher Stanton,et al.  A Study of Hyper-Threading in High-Performance Computing Clusters , 2002 .

[11]  Emilio L. Zapata,et al.  Automatic analytical modeling for the estimation of cache misses , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[12]  G. Edward Suh,et al.  Analytical cache models with applications to cache partitioning , 2001, ICS '01.

[13]  Wen-Hann Wang,et al.  On the inclusion properties for multi-level cache hierarchies , 1988, ISCA '88.

[14]  David A. Padua,et al.  Compile-Time Based Performance Prediction , 1999, LCPC.

[15]  Yan Solihin,et al.  Scal-Tool: Pinpointing and Quantifying Scalability Bottlenecks in DSM Multiprocessors , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[16]  Sharad Malik,et al.  Cache miss equations: a compiler framework for analyzing and tuning memory behavior , 1999, TOPL.

[17]  Dean M. Tullsen,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[18]  Sebastien Hily,et al.  Contention on 2nd Level Cache May Limit the Effectiveness of Simultaneous Multithreading , 1997 .

[19]  Brad Calder,et al.  Phase tracking and prediction , 2003, ISCA '03.

[20]  Josep Torrellas,et al.  Automatically mapping code on an intelligent memory architecture , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[21]  G. Edward Suh,et al.  A new memory monitoring scheme for memory-aware scheduling and partitioning , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[22]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[23]  Harold S. Stone,et al.  Footprints in the cache , 1986, SIGMETRICS '86/PERFORMANCE '86.