A Gaussian Set Sampling Model for Efficient Shared Cache Profiling on Multi-Cores

The last level cache (LLC) has significant impact to system performance on modern multi-core processors. But as cache sizes reach several megabytes and more, the overhead of exploring performance on LLC greatly increases as well. To improve the efficiency of performance analysis, we propose a set-sampling-based cache profiling model for the performance analysis on multi-core LLC. We first explore the memory access distributions on LLC by developing a low-overhead stress-application-based method. The results show that memory access distributions can be approximated by Gaussian distribution function. Based on this observation, a Gaussian-distribution-based set sampling model is proposed which can predict program performance with limited representative samples. We evaluate our model on a contemporary multi-core machine and show that 1) the proposed method can precisely predict program performance on LLC under different contention intensities and 2) our method can achieve similar precision with less samples compared to widely adopted set sampling methods such as the random sampling and the continuous address sampling.

[1]  David Eklov,et al.  Cache Pirating: Measuring the Curse of the Shared Cache , 2011, 2011 International Conference on Parallel Processing.

[2]  Onur Mutlu,et al.  The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  J. Kelly Flanagan,et al.  Facilitating level three cache studies using set sampling , 2000, 2000 Winter Simulation Conference Proceedings (Cat. No.00CH37165).

[4]  Kevin Skadron,et al.  Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Hao Luo,et al.  HOTL: a higher order theory of locality , 2013, ASPLOS '13.

[6]  Li Zhao,et al.  CacheScouts: Fine-Grain Monitoring of Shared Caches in CMP Platforms , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[7]  Roger Sauter,et al.  Introduction to Probability and Statistics for Engineers and Scientists , 2005, Technometrics.

[8]  Jianping Pan,et al.  An Analytical Cache Performance Evaluation Framework for Embedded Out-of-Order Processors Using Software Characteristics , 2018, ACM Trans. Embed. Comput. Syst..

[9]  Xu Cheng,et al.  Improving system throughput and fairness simultaneously in shared memory CMP systems via Dynamic Bank Partitioning , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[10]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[11]  Wang Yi,et al.  Understanding the Dynamic Caches on Intel Processors: Methods and Applications , 2014, 2014 12th IEEE International Conference on Embedded and Ubiquitous Computing.

[12]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[13]  David A. Wood,et al.  A Comparison of Trace-Sampling Techniques for Multi-Megabyte Caches , 1994, IEEE Trans. Computers.

[14]  David A. Wood,et al.  Reuse-based online models for caches , 2013, SIGMETRICS '13.

[15]  David Black-Schaffer,et al.  Modeling performance variation due to cache sharing , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[16]  Xi Chen,et al.  Cache contention and application performance prediction for multi-core systems , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[17]  Tor M. Aamodt,et al.  Modeling Cache Contention and Throughput of Multiprogrammed Manycore Processors , 2012, IEEE Transactions on Computers.

[18]  Erik Hagersten,et al.  StatCache: a probabilistic approach to efficient and accurate data locality analysis , 2004, IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004.

[19]  Angelika Bayer,et al.  A First Course In Probability , 2016 .

[20]  Lingjia Tang,et al.  Directly characterizing cross core interference through contention synthesis , 2011, HiPEAC.

[21]  Rajeev Barua,et al.  Instruction-Cache Locking for Improving Embedded Systems Performance , 2015, ACM Trans. Embed. Comput. Syst..

[22]  Lingjia Tang,et al.  The impact of memory subsystem resource sharing on datacenter applications , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[23]  Ravi Iyer,et al.  Cache QoS: From concept to reality in the Intel® Xeon® processor E5-2600 v3 product family , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[24]  Michael Stumm,et al.  RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations , 2009, ASPLOS.

[25]  Aamer Jaleel,et al.  CRUISE: cache replacement and utility-aware scheduling , 2012, ASPLOS XVII.

[26]  Daniel Sánchez,et al.  Modeling cache performance beyond LRU , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[27]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.