Fast, accurate, and scalable memory modeling of GPGPUs using reuse profiles
暂无分享,去创建一个
Gopinath Chennupati | Stephan Eidenbenz | Abdel-Hameed A. Badawy | Yehia Arafa | Nandakishore Santhi | Abdel-Hameed Badawy | Atanu Barai | S. Eidenbenz | N. Santhi | Gopinath Chennupati | Yehia Arafa | Atanu Barai
[1] John Kim,et al. Improving GPGPU resource utilization through alternative thread block scheduling , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[2] Gopinath Chennupati,et al. An analytical memory hierarchy model for performance prediction , 2017, 2017 Winter Simulation Conference (WSC).
[3] Emmett Kilgariff,et al. Fermi GF100 GPU Architecture , 2011, IEEE Micro.
[4] Aristides Efthymiou,et al. Synthetic Trace-Driven Simulation of Cache Memory , 2007, 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW'07).
[5] Yutao Zhong,et al. Predicting whole-program locality through reuse distance analysis , 2003, PLDI.
[6] Gopinath Chennupati,et al. Scalable Performance Prediction of Codes with Memory Hierarchy and Pipelines , 2019, SIGSIM-PADS.
[7] Sriram Krishnamoorthy,et al. Cache miss characterization and data locality optimization for imperfectly nested loops on shared memory multiprocessors , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.
[8] Amir Rajabzadeh,et al. Efficient Cache Performance Modeling in GPUs Using Reuse Distance Analysis , 2018, ACM Trans. Archit. Code Optim..
[9] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[10] Shuaiwen Song,et al. Locality-Driven Dynamic GPU Cache Bypassing , 2015, ICS.
[11] Tao Tang,et al. Cache Miss Analysis for GPU Programs Based on Stack Distance Profile , 2011, 2011 31st International Conference on Distributed Computing Systems.
[12] C. Cascaval,et al. Calculating stack distances efficiently , 2003, MSP '02.
[13] A. Azzouz. 2011 , 2020, City.
[14] Donald Yeung,et al. Studying multicore processor scaling via reuse distance analysis , 2013, ISCA.
[15] Gopinath Chennupati,et al. Low Overhead Instruction Latency Characterization for NVIDIA GPGPUs , 2019, 2019 IEEE High Performance Extreme Computing Conference (HPEC).
[16] Mahmut T. Kandemir,et al. Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.
[17] Sharon L. Wolchik. 1989 , 2009 .
[18] Mark Horowitz,et al. An analytical cache model , 1989, TOCS.
[19] Brad Calder,et al. Automatically characterizing large scale program behavior , 2002, ASPLOS X.
[20] Gopinath Chennupati,et al. PPT-GPU: Scalable GPU Performance Modeling , 2019, IEEE Computer Architecture Letters.
[21] C. Martin. 2015 , 2015, Les 25 ans de l’OMC: Une rétrospective en photos.
[22] Henk Corporaal,et al. A detailed GPU cache model based on reuse distance theory , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[23] Lieven Eeckhout,et al. Performance analysis through synthetic trace generation , 2000, 2000 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS (Cat. No.00EX422).
[24] Lifan Xu,et al. Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).
[25] Satyajayant Misra,et al. A Scalable Analytical Memory Model for CPU Performance Prediction , 2017, PMBS@SC.
[26] Siddhartha Chatterjee,et al. Exact analysis of the cache behavior of nested loops , 2001, PLDI '01.
[27] Donald Yeung,et al. Guiding Locality Optimizations for Graph Computations via Reuse Distance Analysis , 2017, IEEE Computer Architecture Letters.
[28] Harish Patil,et al. Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.
[29] Dongwei Wang,et al. A reuse distance based performance analysis on GPU L1 data cache , 2016, 2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC).
[30] Gopinath Chennupati,et al. Verified instruction-level energy consumption measurement for NVIDIA GPUs , 2020, CF.
[31] Sudhakar Yalamanchili,et al. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[32] T. G. Venkatesh,et al. Analytical Miss Rate Calculation of L2 Cache from the RD Profile of L1 Cache , 2018, IEEE Transactions on Computers.
[33] Maged M. Michael,et al. Accuracy and speed-up of parallel trace-driven architectural simulation , 1997, Proceedings 11th International Parallel Processing Symposium.
[34] Donald Yeung,et al. Optimizing locality in graph computations using reuse distance profiles , 2017, 2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC).
[35] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[36] David W. Nellans,et al. Flexible software profiling of GPU architectures , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[37] Chen Ding,et al. Miss rate prediction across all program inputs , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.
[38] Yun Liang,et al. An efficient compiler framework for cache bypassing on GPUs , 2013, 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).
[39] Steve Carr,et al. Reuse-distance-based miss-rate prediction on a per instruction basis , 2004, MSP '04.
[40] Nicholas Nethercote,et al. Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.
[41] Erich Strohmaier,et al. Quantifying Locality In The Memory Access Patterns of HPC Applications , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[42] Mahmut T. Kandemir,et al. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.
[43] Hyeran Jeon,et al. Detailed Characterization of Deep Neural Networks on GPUs and FPGAs , 2019, GPGPU@ASPLOS.
[44] Oreste Villa,et al. NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs , 2019, MICRO.
[45] Gopinath Chennupati,et al. GPUs Cache Performance Estimation using Reuse Distance Analysis , 2019, 2019 IEEE 38th International Performance Computing and Communications Conference (IPCCC).
[46] Krishna M. Kavi,et al. Gleipnir: a memory profiling and tracing tool , 2013, CARN.