PARDA: A Fast Parallel Reuse Distance Analysis Algorithm

Reuse distance is a well established approach to characterizing data cache locality based on the stack histogram model. This analysis so far has been restricted to offline use due to the high cost, often several orders of magnitude larger than the execution time of the analyzed code. This paper presents the first parallel algorithm to compute accurate reuse distances by analysis of memory address traces. The algorithm uses a tunable parameter that enables faster analysis when the maximum needed reuse distance is limited by a cache size upper bound. Experimental evaluation using the SPEC CPU 2006 benchmark suite shows that, using 64 processors and a cache bound of 8 MB, it is possible to perform reuse distance analysis with full accuracy within a factor of 13 to 50 times the original execution times of the benchmarks.

[1]  Chen Ding,et al.  Program locality analysis using reuse distance , 2009, TOPL.

[2]  Xipeng Shen,et al.  Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors? , 2010, CC.

[3]  Erik Hagersten,et al.  Modeling Cache Sharing on Chip Multiprocessor Architectures , 2006, 2006 IEEE International Symposium on Workload Characterization.

[4]  S. Abraham,et al.  Eecient Simulation of Multiple Cache Conngurations Using Binomial Trees , 1991 .

[5]  Zhao Zhang,et al.  Soft-OLP: Improving Hardware Cache Performance through Software-Controlled Object-Level Partitioning , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[6]  Steve Carr,et al.  Reuse-distance-based miss-rate prediction on a per instruction basis , 2004, MSP '04.

[7]  David A. Padua,et al.  Calculating stack distances efficiently , 2002, MSP/ISMM.

[8]  Milind Kulkarni,et al.  Accelerating multicore reuse distance analysis with sampling and parallelization , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[9]  Chen Ding,et al.  Miss rate prediction across all program inputs , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[10]  Wentao Chang,et al.  Sampling-based program locality approximation , 2008, ISMM '08.

[11]  John M. Mellor-Crummey,et al.  Cross-architecture performance predictions for scientific applications using parameterized models , 2004, SIGMETRICS '04/Performance '04.

[12]  Vincent J. Kruskal,et al.  LRU Stack Processing , 1975, IBM J. Res. Dev..

[13]  Peter F. Sweeney,et al.  Multiple page size modeling and optimization , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[14]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI.

[15]  Robert E. Tarjan,et al.  Self-adjusting binary search trees , 1985, JACM.

[16]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[17]  Steve Carr,et al.  Instruction based memory distance analysis and its application to optimization , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[18]  Chen Ding,et al.  Locality phase prediction , 2004, ASPLOS XI.

[19]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[20]  Chen Ding,et al.  Array regrouping and structure splitting using whole-program reference affinity , 2004, PLDI '04.

[21]  Frank Olken,et al.  Efficient methods for calculating the success function of fixed space replacement policies , 1983, Perform. Evaluation.

[22]  Song Jiang,et al.  Making LRU friendly to weak locality workloads: a novel replacement algorithm to improve buffer cache performance , 2005, IEEE Transactions on Computers.