Modeling Shared Cache Performance of OpenMP Programs using Reuse Distance

Performance modeling of parallel applications on multicore computers remains a challenge in computational co-design due to the complex design of multicore processors including private and shared memory hierarchies. We present a Scalable Analytical Shared Memory Model to predict the performance of parallel applications that runs on a multicore computer and shares the same level of cache in the hierarchy. This model uses a computationally efficient, probabilistic method to predict the reuse distance profiles, where reuse distance is a hardware architecture-independent measure of the patterns of virtual memory accesses. It relies on a stochastic, static basic block-level analysis of reuse profiles measured from the memory traces of applications ran sequentially on small instances rather than using a multi-threaded trace. The results indicate that the hit-rate predictions on the shared cache are accurate.

[1]  T.G. Venkatesh,et al.  Analytical Derivation of Concurrent Reuse Distance Profile for Multi-Threaded Application Running on Chip Multi-Processor , 2019, IEEE Transactions on Parallel and Distributed Systems.

[2]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI.

[3]  Lieven Eeckhout,et al.  Modeling Superscalar Processor Memory-Level Parallelism , 2018, IEEE Computer Architecture Letters.

[4]  Zhen Yang,et al.  Modeling and Stack Simulation of CMP Cache Capacity and Accessibility , 2009, IEEE Transactions on Parallel and Distributed Systems.

[5]  Jaehyuk Huh,et al.  Exploring the design space of future CMPs , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[6]  Zhe Wang,et al.  Fast and Accurate Exploration of Multi-level Caches Using Hierarchical Reuse Distance , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[7]  Erik Hagersten,et al.  A statistical multiprocessor cache model , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[8]  Xipeng Shen,et al.  Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors? , 2010, CC.

[9]  Chen Ding,et al.  Program locality analysis using reuse distance , 2009, TOPL.

[10]  David A. Padua,et al.  Estimating cache misses and locality using stack distances , 2003, ICS '03.

[11]  Chen Ding,et al.  A Composable Model for Analyzing Locality of Multi-threaded Programs , 2009 .

[12]  Kristof Beyls,et al.  Reuse Distance as a Metric for Cache Behavior. , 2001 .

[13]  Gopinath Chennupati,et al.  Scalable Performance Prediction of Codes with Memory Hierarchy and Pipelines , 2019, SIGSIM-PADS.

[14]  Satyajayant Misra,et al.  A Scalable Analytical Memory Model for CPU Performance Prediction , 2017, PMBS@SC.

[15]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[16]  Chen Ding,et al.  Locality approximation using time , 2007, POPL '07.

[17]  David Black-Schaffer,et al.  Formalizing Data Locality in Task Parallel Applications , 2016, ICA3PP Workshops.

[18]  Cong Xu,et al.  Moguls: A model to explore the memory hierarchy for bandwidth improvements , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[19]  Chen Ding,et al.  Reuse Distance Analysis , 2001 .

[20]  Gopinath Chennupati,et al.  An analytical memory hierarchy model for performance prediction , 2017, 2017 Winter Simulation Conference (WSC).

[21]  Kunle Olukotun,et al.  Maximizing CMP throughput with mediocre cores , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[22]  Bronis R. de Supinski,et al.  A ROSE-Based OpenMP 3.0 Research Compiler Supporting Multiple Runtime Libraries , 2010, IWOMP.

[23]  Chen Ding,et al.  Miss Rate Prediction Across Program Inputs and Cache Configurations , 2007, IEEE Transactions on Computers.

[24]  Vijay Janapa Reddi,et al.  PIN: a binary instrumentation tool for computer architecture research and education , 2004, WCAE '04.

[25]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[26]  Per Stenström,et al.  Performance and power impact of issue-width in chip-multiprocessor cores , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[27]  David A. Wood,et al.  Reuse-based online models for caches , 2013, SIGMETRICS '13.

[28]  David Black-Schaffer,et al.  Analytical Processor Performance and Power Modeling Using Micro-Architecture Independent Characteristics , 2016, IEEE Transactions on Computers.

[29]  Mateo Valero,et al.  Improving Cache Management Policies Using Dynamic Reuse Distances , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[30]  John Shalf,et al.  Exascale Computing Technology Challenges , 2010, VECPAR.

[31]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[32]  Donald Yeung,et al.  Coherent Profiles: Enabling Efficient Reuse Distance Analysis of Multicore Scaling for Loop-based Parallel Programs , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[33]  Milind Kulkarni,et al.  Accelerating multicore reuse distance analysis with sampling and parallelization , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[34]  Derek L. Schuff,et al.  Multicore-aware reuse distance analysis , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[35]  包斌,et al.  Performance Metrics and Models for Shared Cache , 2014 .

[36]  Erik Hagersten,et al.  StatCache: a probabilistic approach to efficient and accurate data locality analysis , 2004, IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004.