PPT-Multicore: performance prediction of OpenMP applications using reuse profiles and analytical modeling
暂无分享,去创建一个
Gopinath Chennupati | Nandakishore Santhi | Stephan Eidenbenz | Abdel-Hameed Badawy | Yehia Arafa | Atanu Barai
[1] Lieven Eeckhout,et al. Computer Architecture Performance Evaluation Methods , 2010, Computer Architecture Performance Evaluation Methods.
[2] Donald Yeung,et al. Identifying Power-Efficient Multicore Cache Hierarchies via Reuse Distance Analysis , 2016, TOCS.
[3] Vikram S. Adve,et al. LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..
[4] David A. Wood,et al. Reuse-based online models for caches , 2013, SIGMETRICS '13.
[5] Kunle Olukotun,et al. Maximizing CMP throughput with mediocre cores , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).
[6] Shunfei Chen,et al. MARSS: A full system simulator for multicore x86 CPUs , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).
[7] Kapil Vaswani,et al. Construction and use of linear regression models for processor performance analysis , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..
[8] Mark D. Hill,et al. Amdahl's Law in the Multicore Era , 2008, Computer.
[9] Zhen Yang,et al. Modeling and Stack Simulation of CMP Cache Capacity and Accessibility , 2009, IEEE Transactions on Parallel and Distributed Systems.
[10] Irving L. Traiger,et al. Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..
[11] Sally A. McKee,et al. Efficiently exploring architectural design spaces via predictive modeling , 2006, ASPLOS XII.
[12] Donald Yeung,et al. Guiding Locality Optimizations for Graph Computations via Reuse Distance Analysis , 2017, IEEE Computer Architecture Letters.
[13] Chen Ding,et al. Program locality analysis using reuse distance , 2009, TOPL.
[14] Chen Ding,et al. Locality approximation using time , 2007, POPL '07.
[15] Jeffrey S. Vetter,et al. Aspen: A domain specific language for performance modeling , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[16] Scott Pakin,et al. Hardware-independent application characterization , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).
[17] Chris J. Scheiman,et al. LogGP: Incorporating Long Messages into the LogP Model for Parallel Computation , 1997, J. Parallel Distributed Comput..
[18] Donald Yeung,et al. Coherent Profiles: Enabling Efficient Reuse Distance Analysis of Multicore Scaling for Loop-based Parallel Programs , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.
[19] Donald Yeung,et al. Using Multicore Reuse Distance to Study Coherence Directories , 2017, ACM Trans. Comput. Syst..
[20] Donald Yeung,et al. Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis , 2012, MSPC '12.
[21] Prasanna Balaprakash,et al. Benchmarking Machine Learning Methods for Performance Modeling of Scientific Applications , 2018, 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).
[22] Xingfu Wu,et al. Performance modeling of hybrid MPI/OpenMP scientific applications on large-scale multicore supercomputers , 2011, J. Comput. Syst. Sci..
[23] Cong Xu,et al. Moguls: A model to explore the memory hierarchy for bandwidth improvements , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).
[24] Gopinath Chennupati,et al. GPUs Cache Performance Estimation using Reuse Distance Analysis , 2019, 2019 IEEE 38th International Performance Computing and Communications Conference (IPCCC).
[25] Christian Bienia,et al. Benchmarking modern multiprocessors , 2011 .
[26] Lifan Xu,et al. Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).
[27] Gopinath Chennupati,et al. Fast, accurate, and scalable memory modeling of GPGPUs using reuse profiles , 2020, ICS.
[28] Gopinath Chennupati,et al. Parallel Application Performance Prediction Using Analysis Based Models and HPC Simulations , 2018, SIGSIM-PADS.
[29] Stefanos Kaxiras,et al. Coherence communication prediction in shared-memory multiprocessors , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).
[30] Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI.
[31] Chen Ding,et al. Miss Rate Prediction Across Program Inputs and Cache Configurations , 2007, IEEE Transactions on Computers.
[32] T.G. Venkatesh,et al. Analytical Derivation of Concurrent Reuse Distance Profile for Multi-Threaded Application Running on Chip Multi-Processor , 2019, IEEE Transactions on Parallel and Distributed Systems.
[33] William J. Dally,et al. Reuse Distance-Based Probabilistic Cache Replacement , 2015, ACM Trans. Archit. Code Optim..
[34] Todd M. Austin,et al. SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.
[35] Hao Luo,et al. Performance Metrics and Models for Shared Cache , 2014, Journal of Computer Science and Technology.
[36] Michael F. P. O'Boyle,et al. Microarchitectural Design Space Exploration Using an Architecture-Centric Approach , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[37] David A. Padua,et al. Estimating cache misses and locality using stack distances , 2003, ICS '03.
[38] YeungDonald,et al. Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs , 2013 .
[39] Gopinath Chennupati,et al. Scalable Performance Prediction of Codes with Memory Hierarchy and Pipelines , 2019, SIGSIM-PADS.
[40] Thomas R. Gross,et al. Lightweight Memory Tracing , 2013, USENIX Annual Technical Conference.
[41] Stijn Eyerman,et al. An Evaluation of High-Level Mechanistic Core Models , 2014, ACM Trans. Archit. Code Optim..
[42] David Defour,et al. Barra: A Parallel Functional Simulator for GPGPU , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.
[43] P. Sadayappan,et al. PARDA: A Fast Parallel Reuse Distance Analysis Algorithm , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[44] Ramesh Subramonian,et al. LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.
[45] Kristof Beyls,et al. Reuse Distance as a Metric for Cache Behavior. , 2001 .
[46] Erik Hagersten,et al. StatCache: a probabilistic approach to efficient and accurate data locality analysis , 2004, IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004.
[47] Nicholas Nethercote,et al. Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.
[48] Paul D. Gader,et al. Image algebra techniques for parallel image processing , 1987 .
[49] Milind Kulkarni,et al. Accelerating multicore reuse distance analysis with sampling and parallelization , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[50] Stefanos Kaxiras,et al. Cache replacement based on reuse-distance prediction , 2007, 2007 25th International Conference on Computer Design.
[51] Zhe Wang,et al. Fast and Accurate Exploration of Multi-level Caches Using Hierarchical Reuse Distance , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[52] Bronis R. de Supinski,et al. A ROSE-Based OpenMP 3.0 Research Compiler Supporting Multiple Runtime Libraries , 2010, IWOMP.
[53] Mateo Valero,et al. Improving Cache Management Policies Using Dynamic Reuse Distances , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[54] John Shalf,et al. Exascale Computing Technology Challenges , 2010, VECPAR.
[55] Vijay Janapa Reddi,et al. PIN: a binary instrumentation tool for computer architecture research and education , 2004, WCAE '04.
[56] Christopher J. Hughes,et al. RSIM: Simulating Shared-Memory Multiprocessors with ILP Processors , 2002, Computer.
[57] Takahiro Katagiri,et al. The SimCore/Alpha Functional Simulator , 2004, WCAE '04.
[58] Xipeng Shen,et al. Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors? , 2010, CC.
[59] David M. Brooks,et al. CPR: Composable performance regression for scalable multiprocessor models , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.
[60] David Black-Schaffer,et al. Analytical Processor Performance and Power Modeling Using Micro-Architecture Independent Characteristics , 2016, IEEE Transactions on Computers.
[61] Misbah Mubarak,et al. Durango: Scalable Synthetic Workload Generation for Extreme-Scale Application Performance Modeling and Simulation , 2017, SIGSIM-PADS.
[62] Mark D. Hill,et al. Amdahl's Law in the Multicore Era , 2008 .
[63] Donald Yeung,et al. Studying multicore processor scaling via reuse distance analysis , 2013, ISCA.
[64] Satyajayant Misra,et al. A Scalable Analytical Memory Model for CPU Performance Prediction , 2017, PMBS@SC.
[65] Derek L. Schuff,et al. Multicore-aware reuse distance analysis , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).
[66] Jack J. Dongarra,et al. Collecting Performance Data with PAPI-C , 2009, Parallel Tools Workshop.
[67] Gopinath Chennupati,et al. PPT-GPU: Scalable GPU Performance Modeling , 2019, IEEE Computer Architecture Letters.
[68] Jaehyuk Huh,et al. Exploring the design space of future CMPs , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.
[69] Gopinath Chennupati,et al. An analytical memory hierarchy model for performance prediction , 2017, 2017 Winter Simulation Conference (WSC).
[70] Thomas F. Wenisch,et al. SimFlex: a fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture , 2004, PERV.
[71] Abdel-Hameed A. Badawy,et al. PPT-SASMM: Scalable Analytical Shared Memory Model: Predicting the Performance of Multicore Caches from a Single-Threaded Execution Trace , 2020, MEMSYS.
[72] Donald Yeung,et al. Optimizing locality in graph computations using reuse distance profiles , 2017, 2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC).
[73] Samuel Williams,et al. ExaSAT: An exascale co-design tool for performance modeling , 2015, Int. J. High Perform. Comput. Appl..
[74] Yutao Zhong,et al. Predicting whole-program locality through reuse distance analysis , 2003, PLDI.
[75] Chen Ding,et al. Reuse Distance Analysis , 2001 .
[76] Bruce Jacob,et al. The structural simulation toolkit , 2006, PERV.
[77] Collin McCurdy,et al. Using Pin as a memory reference generator for multiprocessor simulation , 2005, CARN.
[78] Roland N. Ibbett,et al. HASE: A Flexible Toolset for Computer Architects , 1995, Comput. J..
[79] Nandakishore Santhi,et al. The Simian concept: Parallel Discrete Event Simulation with interpreted languages and just-in-time compilation , 2015, 2015 Winter Simulation Conference (WSC).
[80] Daniel Aarno,et al. Software and System Development using Virtual Platforms: Full-System Simulation with Wind River Simics , 2014 .
[81] Seyong Lee,et al. COMPASS: A Framework for Automated Performance Modeling and Prediction , 2015, ICS.
[82] Chen Ding,et al. A Composable Model for Analyzing Locality of Multi-threaded Programs , 2009 .
[83] L. Dagum,et al. OpenMP: an industry standard API for shared-memory programming , 1998 .
[84] David Black-Schaffer,et al. Formalizing Data Locality in Task Parallel Applications , 2016, ICA3PP Workshops.
[85] Lieven Eeckhout,et al. Modeling Superscalar Processor Memory-Level Parallelism , 2018, IEEE Computer Architecture Letters.
[86] Robert B. Ross,et al. CODES: Enabling Co-Design of Multi-Layer Exascale Storage Architectures , 2011 .