Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis

Understanding multicore memory behavior is crucial, but can be challenging due to the complex cache hierarchies employed in modern CPUs. In today's hierarchies, performance is determined by complicated thread interactions, such as interference in shared caches and replication and communication in private caches. Researchers normally perform extensive simulations to study these interactions, but this can be costly and not very insightful. An alternative is multicore reuse distance (RD) analysis, which can provide extremely rich information about multicore memory behavior. In this paper, we apply multicore RD analysis to better understand cache system design. We focus on loop-based parallel programs, an important class of programs for which RD analysis provides high accuracy. We propose a novel framework to identify optimal multicore cache hierarchies, and extract several new insights. We also characterize how the optimal cache hierarchies vary with core count and problem size.

[1]  Jaehyuk Huh,et al.  Exploring the design space of future CMPs , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[2]  R. Iyer,et al.  Performance , Area and Bandwidth Implications on Large-scale CMP Cache Design , 2007 .

[3]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[4]  Donald Yeung,et al.  Coherent Profiles: Enabling Efficient Reuse Distance Analysis of Multicore Scaling for Loop-based Parallel Programs , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[5]  Milind Kulkarni,et al.  Accelerating multicore reuse distance analysis with sampling and parallelization , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[6]  Srihari Makineni,et al.  Exploring the cache design space for large scale CMPs , 2005, CARN.

[7]  Jian Li,et al.  Power-Performance Implications of Thread-level Parallelism on Chip Multiprocessors , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[8]  Donald Yeung,et al.  Understanding Multicore Cache Behavior of Loop-based Parallel Programs via Reuse Distance Analysis , 2012 .

[9]  Chen Ding,et al.  Linear-time Modeling of Program Working Set in Shared Cache , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[10]  Chen Ding,et al.  A Composable Model for Analyzing Locality of Multi-threaded Programs , 2009 .

[11]  Xipeng Shen,et al.  Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors? , 2010, CC.

[12]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[13]  Derek L. Schuff,et al.  Multicore-aware reuse distance analysis , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[14]  Kunle Olukotun,et al.  Maximizing CMP throughput with mediocre cores , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[15]  Berkin Özisikyilmaz,et al.  MineBench: A Benchmark Suite for Data Mining Workloads , 2006, 2006 IEEE International Symposium on Workload Characterization.

[16]  Chen Ding,et al.  Miss rate prediction across all program inputs , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[17]  Kevin Skadron,et al.  CMP design space exploration subject to physical constraints , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[18]  Collin McCurdy,et al.  Using Pin as a memory reference generator for multiprocessor simulation , 2005, CARN.

[19]  Apan Qasem,et al.  Evaluating a Model for Cache Conflict Miss Prediction , 2005 .

[20]  Milind Kulkarni,et al.  Towards architecture independent metrics for multicore performance analysis , 2011, PERV.

[21]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[22]  Kristof Beyls,et al.  Reuse Distance as a Metric for Cache Behavior. , 2001 .