Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors
暂无分享,去创建一个
Tong Geng | H. Corporaal | S. Stuijk | Ang Li | Wei Sun
[1] Zhengyang Lu,et al. TileSpGEMM: a tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs , 2022, PPoPP.
[2] Ninghui Sun,et al. A Pattern-Based SpGEMM Library for Multi-Core and Many-Core Architectures , 2022, IEEE Transactions on Parallel and Distributed Systems.
[3] Junjie Lai,et al. Optimizing Winograd-Based Convolution with Tensor Cores , 2021, ICPP.
[4] Boyuan Feng,et al. APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.
[5] Guoyang Chen,et al. EGEMM-TC: accelerating scientific computing on tensor cores with extended precision , 2021, PPoPP.
[6] Brian Chmiel,et al. Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N: M Transposable Masks , 2021, NeurIPS.
[7] Zhijie Zhang,et al. Learning N: M Fine-grained Structured Sparse Neural Networks From Scratch , 2021, ICLR.
[8] Ang Li,et al. Accelerating Binarized Neural Networks via Bit-Tensor-Cores in Turing GPUs , 2020, IEEE Transactions on Parallel and Distributed Systems.
[9] Maciej Urbanski,et al. Intel Nervana Neural Network Processor-T (NNP-T) Fused Floating Point Many-Term Dot Product , 2020, 2020 IEEE 27th Symposium on Computer Arithmetic (ARITH).
[10] Xiaowen Chu,et al. Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[11] Nicholas J. Higham,et al. Numerical behavior of NVIDIA tensor cores , 2020, PeerJ Comput. Sci..
[12] Yongjun Park,et al. Optimization of GPU-based Sparse Matrix Multiplication for Large Sparse Networks , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).
[13] Xiaowen Chu,et al. Optimizing batched winograd convolution on GPUs , 2020, PPoPP.
[14] Martin Winter,et al. spECK: accelerating GPU sparse matrix-matrix multiplication through lightweight analysis , 2020, PPoPP.
[15] Kenli Li,et al. Performance-Aware Model for Sparse Matrix-Matrix Multiplication on the Sunway TaihuLight Supercomputer , 2019, IEEE Transactions on Parallel and Distributed Systems.
[16] Marco Maggioni,et al. Dissecting the NVidia Turing T4 GPU via Microbenchmarking , 2019, ArXiv.
[17] Xu Liu,et al. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect , 2019, IEEE Transactions on Parallel and Distributed Systems.
[18] Chirag Ravishankar,et al. Xilinx Adaptive Compute Acceleration Platform: VersalTM Architecture , 2019, FPGA.
[19] Hans-Peter Seidel,et al. Adaptive sparse matrix-matrix multiplication on the GPU , 2019, PPoPP.
[20] Tor M. Aamodt,et al. Modeling Deep Learning Accelerator Enabled GPUs , 2018, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[21] Nicholas J. Higham,et al. Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[22] Matt Martineau,et al. Benchmarking the NVIDIA V100 GPU and Tensor Cores , 2018, Euro-Par Workshops.
[23] Marco Maggioni,et al. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.
[24] Jeffrey S. Vetter,et al. NVIDIA Tensor Core Programmability, Performance & Precision , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[25] Mehmet Deveci,et al. Multi-threaded Sparse Matrix-Matrix Multiplication for Many-Core and GPU Architectures , 2018, Parallel Comput..
[26] Hao Wu,et al. Mixed Precision Training , 2017, ICLR.
[27] Weng-Fai Wong,et al. Exploiting half precision arithmetic in Nvidia GPUs , 2017, 2017 IEEE High Performance Extreme Computing Conference (HPEC).
[28] P. Sadayappan,et al. On improving performance of sparse matrix-matrix multiplication on GPUs , 2017, ICS.
[29] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[30] Mingyu Chen,et al. Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning , 2017, PPoPP.
[31] Yonggang Wen,et al. Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication , 2016, ICS.
[32] Luke N. Olson,et al. Optimizing Sparse Matrix—Matrix Multiplication for the GPU , 2015, ACM Trans. Math. Softw..
[33] Andrew Lavin,et al. Fast Algorithms for Convolutional Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[34] Xinxin Mei,et al. Dissecting GPU Memory Hierarchy Through Microbenchmarking , 2015, IEEE Transactions on Parallel and Distributed Systems.
[35] Andreas Moshovos,et al. Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).
[36] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[37] Cutlass , 2021, Encyclopedic Dictionary of Archaeology.
[38] Boyuan Feng,et al. TC-GNN: Accelerating Sparse Graph Neural Network Computation Via Dense Tensor Core on GPUs , 2021, ArXiv.
[39] Hongsheng Li,et al. DominoSearch: Find layer-wise fine-grained N: M sparse schemes from dense neural networks , 2021, NeurIPS.
[40] Jie Cheng,et al. Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..