PPT-Multicore: performance prediction of OpenMP applications using reuse profiles and analytical modeling

We present PPT-Multicore , an analytical model embedded in the Performance Prediction Toolkit (PPT) to predict parallel applications’ performance running on a multicore processor. PPT-Multicore builds upon our previous work towards a multicore cache model. We extract LLVM basic block labeled memory trace using an architecture-independent LLVM-based instrumentation tool only once in an application’s lifetime. The model uses the memory trace and other parameters from an instrumented sequentially executed binary. We use probabilistic and computationally efficient reuse profiles to predict the cache hit rates and runtimes of OpenMP programs’ parallel sections. We model Intel’s Broadwell, Haswell, and AMD’s Zen2 architectures and validate our framework using different applications from PolyBench and PARSEC benchmark suites. The results show that PPT-Multicore can predict cache hit rates with an overall average error rate of 1.23% while predicting the runtime with an error rate of 9.08%.

[1]  Lieven Eeckhout,et al.  Computer Architecture Performance Evaluation Methods , 2010, Computer Architecture Performance Evaluation Methods.

[2]  Donald Yeung,et al.  Identifying Power-Efficient Multicore Cache Hierarchies via Reuse Distance Analysis , 2016, TOCS.

[3]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[4]  David A. Wood,et al.  Reuse-based online models for caches , 2013, SIGMETRICS '13.

[5]  Kunle Olukotun,et al.  Maximizing CMP throughput with mediocre cores , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[6]  Shunfei Chen,et al.  MARSS: A full system simulator for multicore x86 CPUs , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[7]  Kapil Vaswani,et al.  Construction and use of linear regression models for processor performance analysis , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[8]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008, Computer.

[9]  Zhen Yang,et al.  Modeling and Stack Simulation of CMP Cache Capacity and Accessibility , 2009, IEEE Transactions on Parallel and Distributed Systems.

[10]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[11]  Sally A. McKee,et al.  Efficiently exploring architectural design spaces via predictive modeling , 2006, ASPLOS XII.

[12]  Donald Yeung,et al.  Guiding Locality Optimizations for Graph Computations via Reuse Distance Analysis , 2017, IEEE Computer Architecture Letters.

[13]  Chen Ding,et al.  Program locality analysis using reuse distance , 2009, TOPL.

[14]  Chen Ding,et al.  Locality approximation using time , 2007, POPL '07.

[15]  Jeffrey S. Vetter,et al.  Aspen: A domain specific language for performance modeling , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Scott Pakin,et al.  Hardware-independent application characterization , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[17]  Chris J. Scheiman,et al.  LogGP: Incorporating Long Messages into the LogP Model for Parallel Computation , 1997, J. Parallel Distributed Comput..

[18]  Donald Yeung,et al.  Coherent Profiles: Enabling Efficient Reuse Distance Analysis of Multicore Scaling for Loop-based Parallel Programs , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[19]  Donald Yeung,et al.  Using Multicore Reuse Distance to Study Coherence Directories , 2017, ACM Trans. Comput. Syst..

[20]  Donald Yeung,et al.  Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis , 2012, MSPC '12.

[21]  Prasanna Balaprakash,et al.  Benchmarking Machine Learning Methods for Performance Modeling of Scientific Applications , 2018, 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).

[22]  Xingfu Wu,et al.  Performance modeling of hybrid MPI/OpenMP scientific applications on large-scale multicore supercomputers , 2011, J. Comput. Syst. Sci..

[23]  Cong Xu,et al.  Moguls: A model to explore the memory hierarchy for bandwidth improvements , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[24]  Gopinath Chennupati,et al.  GPUs Cache Performance Estimation using Reuse Distance Analysis , 2019, 2019 IEEE 38th International Performance Computing and Communications Conference (IPCCC).

[25]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[26]  Lifan Xu,et al.  Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).

[27]  Gopinath Chennupati,et al.  Fast, accurate, and scalable memory modeling of GPGPUs using reuse profiles , 2020, ICS.

[28]  Gopinath Chennupati,et al.  Parallel Application Performance Prediction Using Analysis Based Models and HPC Simulations , 2018, SIGSIM-PADS.

[29]  Stefanos Kaxiras,et al.  Coherence communication prediction in shared-memory multiprocessors , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[30]  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI.

[31]  Chen Ding,et al.  Miss Rate Prediction Across Program Inputs and Cache Configurations , 2007, IEEE Transactions on Computers.

[32]  T.G. Venkatesh,et al.  Analytical Derivation of Concurrent Reuse Distance Profile for Multi-Threaded Application Running on Chip Multi-Processor , 2019, IEEE Transactions on Parallel and Distributed Systems.

[33]  William J. Dally,et al.  Reuse Distance-Based Probabilistic Cache Replacement , 2015, ACM Trans. Archit. Code Optim..

[34]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[35]  Hao Luo,et al.  Performance Metrics and Models for Shared Cache , 2014, Journal of Computer Science and Technology.

[36]  Michael F. P. O'Boyle,et al.  Microarchitectural Design Space Exploration Using an Architecture-Centric Approach , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[37]  David A. Padua,et al.  Estimating cache misses and locality using stack distances , 2003, ICS '03.

[38]  YeungDonald,et al.  Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs , 2013 .

[39]  Gopinath Chennupati,et al.  Scalable Performance Prediction of Codes with Memory Hierarchy and Pipelines , 2019, SIGSIM-PADS.

[40]  Thomas R. Gross,et al.  Lightweight Memory Tracing , 2013, USENIX Annual Technical Conference.

[41]  Stijn Eyerman,et al.  An Evaluation of High-Level Mechanistic Core Models , 2014, ACM Trans. Archit. Code Optim..

[42]  David Defour,et al.  Barra: A Parallel Functional Simulator for GPGPU , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[43]  P. Sadayappan,et al.  PARDA: A Fast Parallel Reuse Distance Analysis Algorithm , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[44]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[45]  Kristof Beyls,et al.  Reuse Distance as a Metric for Cache Behavior. , 2001 .

[46]  Erik Hagersten,et al.  StatCache: a probabilistic approach to efficient and accurate data locality analysis , 2004, IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004.

[47]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[48]  Paul D. Gader,et al.  Image algebra techniques for parallel image processing , 1987 .

[49]  Milind Kulkarni,et al.  Accelerating multicore reuse distance analysis with sampling and parallelization , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[50]  Stefanos Kaxiras,et al.  Cache replacement based on reuse-distance prediction , 2007, 2007 25th International Conference on Computer Design.

[51]  Zhe Wang,et al.  Fast and Accurate Exploration of Multi-level Caches Using Hierarchical Reuse Distance , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[52]  Bronis R. de Supinski,et al.  A ROSE-Based OpenMP 3.0 Research Compiler Supporting Multiple Runtime Libraries , 2010, IWOMP.

[53]  Mateo Valero,et al.  Improving Cache Management Policies Using Dynamic Reuse Distances , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[54]  John Shalf,et al.  Exascale Computing Technology Challenges , 2010, VECPAR.

[55]  Vijay Janapa Reddi,et al.  PIN: a binary instrumentation tool for computer architecture research and education , 2004, WCAE '04.

[56]  Christopher J. Hughes,et al.  RSIM: Simulating Shared-Memory Multiprocessors with ILP Processors , 2002, Computer.

[57]  Takahiro Katagiri,et al.  The SimCore/Alpha Functional Simulator , 2004, WCAE '04.

[58]  Xipeng Shen,et al.  Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors? , 2010, CC.

[59]  David M. Brooks,et al.  CPR: Composable performance regression for scalable multiprocessor models , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[60]  David Black-Schaffer,et al.  Analytical Processor Performance and Power Modeling Using Micro-Architecture Independent Characteristics , 2016, IEEE Transactions on Computers.

[61]  Misbah Mubarak,et al.  Durango: Scalable Synthetic Workload Generation for Extreme-Scale Application Performance Modeling and Simulation , 2017, SIGSIM-PADS.

[62]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008 .

[63]  Donald Yeung,et al.  Studying multicore processor scaling via reuse distance analysis , 2013, ISCA.

[64]  Satyajayant Misra,et al.  A Scalable Analytical Memory Model for CPU Performance Prediction , 2017, PMBS@SC.

[65]  Derek L. Schuff,et al.  Multicore-aware reuse distance analysis , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[66]  Jack J. Dongarra,et al.  Collecting Performance Data with PAPI-C , 2009, Parallel Tools Workshop.

[67]  Gopinath Chennupati,et al.  PPT-GPU: Scalable GPU Performance Modeling , 2019, IEEE Computer Architecture Letters.

[68]  Jaehyuk Huh,et al.  Exploring the design space of future CMPs , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[69]  Gopinath Chennupati,et al.  An analytical memory hierarchy model for performance prediction , 2017, 2017 Winter Simulation Conference (WSC).

[70]  Thomas F. Wenisch,et al.  SimFlex: a fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture , 2004, PERV.

[71]  Abdel-Hameed A. Badawy,et al.  PPT-SASMM: Scalable Analytical Shared Memory Model: Predicting the Performance of Multicore Caches from a Single-Threaded Execution Trace , 2020, MEMSYS.

[72]  Donald Yeung,et al.  Optimizing locality in graph computations using reuse distance profiles , 2017, 2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC).

[73]  Samuel Williams,et al.  ExaSAT: An exascale co-design tool for performance modeling , 2015, Int. J. High Perform. Comput. Appl..

[74]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI.

[75]  Chen Ding,et al.  Reuse Distance Analysis , 2001 .

[76]  Bruce Jacob,et al.  The structural simulation toolkit , 2006, PERV.

[77]  Collin McCurdy,et al.  Using Pin as a memory reference generator for multiprocessor simulation , 2005, CARN.

[78]  Roland N. Ibbett,et al.  HASE: A Flexible Toolset for Computer Architects , 1995, Comput. J..

[79]  Nandakishore Santhi,et al.  The Simian concept: Parallel Discrete Event Simulation with interpreted languages and just-in-time compilation , 2015, 2015 Winter Simulation Conference (WSC).

[80]  Daniel Aarno,et al.  Software and System Development using Virtual Platforms: Full-System Simulation with Wind River Simics , 2014 .

[81]  Seyong Lee,et al.  COMPASS: A Framework for Automated Performance Modeling and Prediction , 2015, ICS.

[82]  Chen Ding,et al.  A Composable Model for Analyzing Locality of Multi-threaded Programs , 2009 .

[83]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[84]  David Black-Schaffer,et al.  Formalizing Data Locality in Task Parallel Applications , 2016, ICA3PP Workshops.

[85]  Lieven Eeckhout,et al.  Modeling Superscalar Processor Memory-Level Parallelism , 2018, IEEE Computer Architecture Letters.

[86]  Robert B. Ross,et al.  CODES: Enabling Co-Design of Multi-Layer Exascale Storage Architectures , 2011 .