RDV: A New Method for Memory Phase Detection and Simulation Points Selection

Previous studies proposed several code signatures, with large vector dimensions and time-consuming profiling processes, to detect phase transitions of the overall processor performance. However, there still lacks an effective and efficient method to detect and leverage the phase characteristics of memory access. This paper proposes the Reuse Distance Vector (RDV), a new metric that tightly coupled with the cache performance, to summarize the phase behavior in the memory hierarchy. Different from the commonly seen huge dimensionality of other code signatures, RDVs are collected at very low dimensions. Meanwhile, the profiling overhead can be further reduced by our sampling technique. Based on RDVs, the simulation points can be found via a clustering method to reduce the time consumption of cycle-accurate simulations. Using the simulation points found by RDVs, the average relative error of cache miss rate estimation is as low as 1.08%, which outperforms the accuracies coming from BBVs and EIPVs by 79% and 22% (all compared methods use the same number of simulation points that take 0.4% of the whole program). The average errors of MLP and the cache miss service time are only 0.9% and 1.8%, respectively.

[1]  David A. Wood,et al.  Reuse-based online models for caches , 2013, SIGMETRICS '13.

[2]  Chen Ding,et al.  Locality phase prediction , 2004, ASPLOS XI.

[3]  Brad Calder,et al.  Detecting phases in parallel applications on shared memory architectures , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[4]  James E. Smith,et al.  Comparing program phase detection techniques , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[5]  Mateo Valero,et al.  Improving Cache Management Policies Using Dynamic Reuse Distances , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[6]  Dawei Wang,et al.  Concurrent Average Memory Access Time , 2014, Computer.

[7]  Pangfeng Liu,et al.  Sampling-Based Phase Classification and Prediction for Multi-threaded Program Execution on Multi-core Architectures , 2013, 2013 42nd International Conference on Parallel Processing.

[8]  Ryan N. Rakvic,et al.  The Fuzzy Correlation between Code and Performance Predictability , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[9]  Brad Calder,et al.  The Strong correlation Between Code Signatures and Performance , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[10]  Bengt Jonsson,et al.  A modeling framework for reuse distance-based estimation of cache performance , 2015, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[11]  David Eklov,et al.  Efficient software-based online phase classification , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[12]  Brad Calder,et al.  Phase tracking and prediction , 2003, ISCA '03.

[13]  Lieven Eeckhout,et al.  Measuring benchmark similarity using inherent program characteristics , 2006, IEEE Transactions on Computers.

[14]  Brad Calder,et al.  Transition phase classification and prediction , 2005, 11th International Symposium on High-Performance Computer Architecture.

[15]  Sally A. McKee,et al.  Dynamic program phase detection in distributed shared-memory multiprocessors , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[16]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI.

[17]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[18]  Lizy K. John,et al.  Performance characterization of SPEC CPU benchmarks on intel's core microarchitecture based processor , 2007 .

[19]  R. Balasubramonian,et al.  Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[20]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[21]  André Seznec,et al.  Choosing representative slices of program execution for microarchitecture simulations: a preliminary , 2000 .

[22]  T. K. Prakash,et al.  Performance Characterization of SPEC CPU 2006 Benchmarks on Intel Core 2 Duo Processor , .

[23]  Brad Calder,et al.  Basic block distribution analysis to find periodic behavior and simulation points in applications , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[24]  Longxing Shi,et al.  A mechanistic model of memory level parallelism fed with cache miss rates , 2017, 2017 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM).

[25]  Jianping Pan,et al.  AFEC: An analytical framework for evaluating cache performance in out-of-order processors , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[26]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[27]  Stefanos Kaxiras,et al.  Cache replacement based on reuse-distance prediction , 2007, 2007 25th International Conference on Computer Design.

[28]  David Eklov,et al.  StatStack: Efficient modeling of LRU caches , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).