Accurate prediction of the behavior of multithreaded applications in shared caches

Multicores are the norm nowadays and in many of them there are cores that share one or several levels of cache. The theoretical performance gain expected when several cores cooperate in the parallel execution of an application can be reduced in some cases by a cache access bottleneck, as the data accessed by them can interfere in the shared cache levels. In other cases the performance gain can be increased due to a greater reuse of the data loaded in the cache. This paper presents an analytical model that can predict the behavior of shared caches when executing applications parallelized at loop level. To the best of our knowledge, this is the first analytical model that tackles the behavior of multithreaded applications on realistic shared caches without requiring profiling. The experimental results show that the model predictions are precise and very fast and that the model can help a compiler or programmer choose the best parallelization strategy.

[1]  Zhao Zhang,et al.  Soft-OLP: Improving Hardware Cache Performance through Software-Controlled Object-Level Partitioning , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[2]  Zhao Zhang,et al.  Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[3]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[4]  Yan Solihin,et al.  Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[5]  Basilio B. Fraguela,et al.  Automatic Tuning of Discrete Fourier Transforms Driven by Analytical Modeling , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[6]  Yan Solihin,et al.  Predicting cache space contention in utility computing servers , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[7]  Kristof Beyls,et al.  Generating cache hints for improved program efficiency , 2005, J. Syst. Archit..

[8]  Xipeng Shen,et al.  Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors? , 2010, CC.

[9]  Milind Kulkarni,et al.  Accelerating multicore reuse distance analysis with sampling and parallelization , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[10]  Yan Solihin,et al.  Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.

[11]  Josep Llosa,et al.  A fast and accurate framework to analyze and optimize cache memory behavior , 2004, TOPL.

[12]  Jack J. Dongarra,et al.  L2 Cache Modeling for Scientific Applications on Chip Multi-Processors , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[13]  Juan Touriño,et al.  Automated and accurate cache behavior analysis for codes with irregular access patterns: Research Articles , 2007 .

[14]  Uday Bondhugula,et al.  Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors , 2009, PPoPP '09.

[15]  No License,et al.  Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .

[16]  Emilio L. Zapata,et al.  A compiler tool to predict memory hierarchy performance of scientific codes , 2004, Parallel Comput..

[17]  David A. Padua,et al.  Estimating cache misses and locality using stack distances , 2003, ICS '03.

[18]  Jingling Xue,et al.  Efficient and accurate analytical modeling of whole-program data cache behavior , 2004, IEEE Transactions on Computers.

[19]  Juan Touriño,et al.  Automated and accurate cache behavior analysis for codes with irregular access patterns , 2007, Concurr. Comput. Pract. Exp..

[20]  Mahmut T. Kandemir,et al.  Optimizing shared cache behavior of chip multiprocessors , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21]  Arun Chauhan,et al.  Static reuse distances for locality-based optimizations in MATLAB , 2010, ICS '10.

[22]  Mahmut T. Kandemir,et al.  On-chip cache hierarchy-aware tile scheduling for multicore machines , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[23]  Emilio L. Zapata,et al.  Probabilistic Miss Equations: Evaluating Memory Hierarchy Performance , 2003, IEEE Trans. Computers.

[24]  Norman P. Jouppi,et al.  Multi-Core Cache Hierarchies , 2011, Multi-Core Cache Hierarchies.

[25]  Basilio B. Fraguela,et al.  Optimal Tile Size Selection Guided by Analytical Models , 2005, PARCO.

[26]  Xi Chen,et al.  Cache contention and application performance prediction for multi-core systems , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[27]  Michael Stumm,et al.  Reducing the harmful effects of last-level cache polluters with an OS-level, software-only pollute buffer , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.