Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors?

On Chip Multiprocessors (CMP), it is common that multiple cores share certain levels of cache. The sharing increases the contention in cache and memory-to-chip bandwidth, further highlighting the importance of data locality analysis. As a rigorous and hardware-independent locality metric, reuse distance has served for a variety of locality analysis, program transformations, and performance prediction. However, previous studies have concentrated on sequential programs running on unicore processors. On CMP, accesses by different threads (or jobs) interact in the shared cache. How reuse distance applies to the new architecture remains an open question—particularly, how the interactions in shared cache affect the collection and application of reuse distance, and how reuse-distance–based locality analysis should adapt to such architecture changes. This paper presents our explorations towards answering those questions. It first introduces the concept of concurrent reuse distance, a direct extension of the traditional concept of reuse distance with data references by all co-running threads (or jobs) considered. It then discusses the properties of concurrent reuse distance, revealing the special challenges facing the collection and application of concurrent reuse distance on CMP platforms. Finally, it presents the solutions to those challenges for a class of multithreading applications. The solutions center on a probabilistic model that connects concurrent reuse distance with the data locality of each individual thread. Experiments demonstrate the effectiveness of the proposed techniques in facilitating the uses of concurrent reuse distance for CMP computing.

[1]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[2]  Erik Hagersten,et al.  Fast data-locality profiling of native execution , 2005, SIGMETRICS '05.

[3]  Mellor-CrummeyJohn,et al.  Cross-architecture performance predictions for scientific applications using parameterized models , 2004 .

[4]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[5]  Harold S. Stone,et al.  Footprints in the cache , 1986, SIGMETRICS '86/PERFORMANCE '86.

[6]  Michael Stumm,et al.  Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors , 2007, EuroSys '07.

[7]  Tor M. Aamodt,et al.  A first-order fine-grained multithreaded throughput model , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[8]  Yan Solihin,et al.  Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.

[9]  Srihari Makineni,et al.  Communist, Utilitarian, and Capitalist cache policies on CMPs: Caches as a shared resource , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[10]  Chen Ding,et al.  Locality approximation using time , 2007, POPL '07.

[11]  Chen Ding,et al.  All-window profiling of concurrent executions , 2008, PPoPP.

[12]  Steve Carr,et al.  Feedback-directed memory disambiguation through store distance analysis , 2006, ICS '06.

[13]  Dean M. Tullsen,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[14]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI '03.

[15]  Alan Jay Smith,et al.  On the effectiveness of set associative page mapping and its application to main memory management , 1976, ICSE '76.

[16]  Wentao Chang,et al.  Sampling-based program locality approximation , 2008, ISMM '08.

[17]  John M. Mellor-Crummey,et al.  Cross-architecture performance predictions for scientific applications using parameterized models , 2004, SIGMETRICS '04/Performance '04.

[18]  Kristof Beyls,et al.  Reuse Distance as a Metric for Cache Behavior. , 2001 .

[19]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[20]  G. Edward Suh,et al.  Analytical cache models with applications to cache partitioning , 2001, ICS '01.

[21]  Xipeng Shen,et al.  Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? , 2010, PPoPP '10.

[22]  M TullsenDean,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000 .

[23]  Chen Ding,et al.  Regression-Based Multi-Model Prediction of Data Reuse Signature , 2003 .

[24]  Chen Ding,et al.  Miss Rate Prediction Across Program Inputs and Cache Configurations , 2007, IEEE Transactions on Computers.

[25]  Peter J. Denning,et al.  Thrashing: its causes and prevention , 1968, AFIPS Fall Joint Computing Conference.

[26]  Kristof Beyls,et al.  Discovery of Locality-Improving Refactorings by Reuse Path Analysis , 2006, HPCC.

[27]  BergErik,et al.  Fast data-locality profiling of native execution , 2005 .

[28]  Xipeng Shen,et al.  Scalable Implementation of Efficient Locality Approximation , 2008, LCPC.

[29]  Won-Taek Lim,et al.  Architectural support for operating system-driven CMP cache management , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[30]  Michael D. Smith,et al.  Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[31]  Chen Ding,et al.  Program locality analysis using reuse distance , 2009, TOPL.

[32]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[33]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[34]  Steve Carr,et al.  Instruction based memory distance analysis and its application to optimization , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[35]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[36]  Chen Ding,et al.  Array regrouping and structure splitting using whole-program reference affinity , 2004, PLDI '04.

[37]  Zhao Zhang,et al.  Soft-OLP: Improving Hardware Cache Performance through Software-Controlled Object-Level Partitioning , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[38]  Alex Settle,et al.  Architectural Support for Enhanced SMT Job Scheduling , 2004, IEEE PACT.

[39]  Chen Ding,et al.  Miss rate prediction across all program inputs , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[40]  George Ho,et al.  PAPI: A Portable Interface to Hardware Performance Counters , 1999 .

[41]  Barbara M. Chapman,et al.  Evaluating OpenMP on Chip MultiThreading Platforms , 2005, IWOMP.