Low Overhead Instruction Latency Characterization for NVIDIA GPGPUs
暂无分享,去创建一个
Gopinath Chennupati | Stephan Eidenbenz | Abdel-Hameed A. Badawy | Abdel-Hameed Badawy | Yehia Arafa | Nandakishore Santhi | S. Eidenbenz | N. Santhi | Gopinath Chennupati | Yehia Arafa
[1] Xiaoming Li,et al. A Micro-benchmark Suite for AMD GPUs , 2010, 2010 39th International Conference on Parallel Processing Workshops.
[2] Weng-Fai Wong,et al. Exploiting half precision arithmetic in Nvidia GPUs , 2017, 2017 IEEE High Performance Extreme Computing Conference (HPEC).
[3] Shuaiwen Song,et al. CUDAAdvisor: LLVM-based runtime profiling for modern GPUs , 2018, CGO.
[4] Prasun Gera,et al. Performance Characterisation and Simulation of Intel's Integrated GPU Architecture , 2018, 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[5] Barbara M. Chapman,et al. Is Data Placement Optimization Still Relevant on Newer GPUs? , 2018, 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).
[6] Ninghui Sun,et al. Fast implementation of DGEMM on Fermi GPU , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[7] David W. Nellans,et al. Flexible software profiling of GPU architectures , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[8] Alfredo Goldman,et al. Autotuning CUDA compiler parameters for heterogeneous applications using the OpenTuner framework , 2017, Concurr. Comput. Pract. Exp..
[9] Andreas Moshovos,et al. Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).
[10] André Seznec,et al. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[11] Gopinath Chennupati,et al. PPT-GPU: Scalable GPU Performance Modeling , 2019, IEEE Computer Architecture Letters.
[12] Gopinath Chennupati,et al. PPT-GPU: performance prediction toolkit for GPUs identifying the impact of caches: extended abstract , 2018, MEMSYS.
[13] Scott A. Mahlke,et al. Paraprox: pattern-based approximation for data parallel applications , 2014, ASPLOS.
[14] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[15] Mike Murphy,et al. CUDA: Compiling and optimizing for a GPU platform , 2012, ICCS.
[16] Naoya Maruyama,et al. Optimizing Stencil Computations for NVIDIA Kepler GPUs , 2014 .
[17] Hao Wu,et al. Mixed Precision Training , 2017, ICLR.
[18] Marco Maggioni,et al. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.
[19] Harish Patil,et al. Fast Computational GPU Design with GT-Pin , 2015, 2015 IEEE International Symposium on Workload Characterization.
[20] Vasily Volkov. A microbenchmark to study GPU performance models , 2018, PPOPP.
[21] Xinxin Mei,et al. Dissecting GPU Memory Hierarchy Through Microbenchmarking , 2015, IEEE Transactions on Parallel and Distributed Systems.
[22] Max Grossman,et al. Professional CUDA C Programming , 2014 .
[23] Shoaib Kamil,et al. OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).
[24] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.
[25] John E. Stone,et al. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.
[26] Yi Yang,et al. A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.
[27] Mingyu Chen,et al. Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning , 2017, PPoPP.
[28] Mahmut T. Kandemir,et al. Holistic Management of the GPGPU Memory Hierarchy to Manage Warp-level Latency Tolerance , 2018, ArXiv.
[29] Xiaohua Shi,et al. An OpenCL Micro-Benchmark Suite for GPUs and CPUs , 2012, PDCAT.
[30] Hadi Esmaeilzadeh,et al. AxBench: A Multiplatform Benchmark Suite for Approximate Computing , 2017, IEEE Design & Test.