GPU Tensor Cores for Fast Arithmetic Reductions
暂无分享,去创建一个
Ricardo J. Barrientos | Cristóbal A. Navarro | Raimundo Vega | Roberto Carrasco | Javier A. Riquelme
[1] Ye Zhao,et al. Lattice Boltzmann based PDE solver on the GPU , 2008, The Visual Computer.
[2] Matt Martineau,et al. Benchmarking the NVIDIA V100 GPU and Tensor Cores , 2018, Euro-Par Workshops.
[3] Nancy Hitschfeld-Kahler,et al. A Survey on Parallel Computing and its Applications in Data-Parallel Problems Using GPU Architectures , 2014 .
[4] Nancy Hitschfeld-Kahler,et al. GPU parallel simulation algorithm of Brownian particles with excluded volume using Delaunay triangulations , 2017, Comput. Phys. Commun..
[5] Richard P. Brent,et al. The Parallel Evaluation of General Arithmetic Expressions , 1974, JACM.
[6] Nancy Hitschfeld-Kahler,et al. A high-speed tracking algorithm for dense granular media , 2018, Comput. Phys. Commun..
[7] Timothy M. Chan. More algorithms for all-pairs shortest paths in weighted graphs , 2007, STOC '07.
[8] Kevin Skadron,et al. Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).
[9] Eric S. Chung,et al. A reconfigurable fabric for accelerating large-scale datacenter services , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).
[10] Mark J. Harris. Mapping computational concepts to GPUs , 2005, SIGGRAPH Courses.
[11] John D. Owens,et al. Multi-GPU MapReduce on GPU Clusters , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[12] Hans Vandierendonck,et al. A Case Study of OpenMP Applied to Map/Reduce-Style Computations , 2015, IWOMP.
[13] Benjamin Bustos,et al. Competitiveness of a Non-Linear Block-Space GPU Thread Map for Simplex Domains , 2018, IEEE Transactions on Parallel and Distributed Systems.
[14] Jeffrey S. Vetter,et al. NVIDIA Tensor Core Programmability, Performance & Precision , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[15] Jack J. Dongarra,et al. Investigating half precision arithmetic to accelerate dense linear system solvers , 2017, ScalA@SC.
[16] Martin Margala,et al. Exploration of Low Numeric Precision Deep Learning Inference Using Intel® FPGAs , 2018, FCCM.
[17] Makoto Taiji,et al. A Comparative Study on ASIC, FPGAs, GPUs and General Purpose Processors in the O(N^2) Gravitational N-body Simulation , 2009, 2009 NASA/ESA Conference on Adaptive Hardware and Systems.
[18] Georg Hager,et al. Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.
[19] Alberto Cano,et al. A survey on graphic processing unit computing for large‐scale data mining , 2018, WIREs Data Mining Knowl. Discov..
[20] Kashif Nizam Khan,et al. RAPL in Action: Experiences in Using RAPL for Power Measurements , 2020 .
[21] Wei Huang,et al. Adaptive multi-GPU Exchange Monte Carlo for the 3D Random Field Ising Model , 2015, Comput. Phys. Commun..
[22] Marco Maggioni,et al. Dissecting the NVidia Turing T4 GPU via Microbenchmarking , 2019, ArXiv.
[23] S. Wolfram. Statistical mechanics of cellular automata , 1983 .
[24] Xun Jia,et al. GPU-based fast Monte Carlo simulation for radiotherapy dose calculation. , 2011, Physics in medicine and biology.
[25] Joachim von zur Gathen,et al. Parallel Arithmetic Computations: A Survey , 1986, MFCS.
[26] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[27] Wang,et al. In-Datacenter Performance Analysis of a Tensor Processing UnitTM , .
[28] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..
[29] David A. Patterson,et al. Motivation for and Evaluation of the First Tensor Processing Unit , 2018, IEEE Micro.
[30] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.
[31] H.-P. Seidel,et al. Realtime Ray Tracing on GPU with BVH-based Packet Traversal , 2007, 2007 IEEE Symposium on Interactive Ray Tracing.
[32] Cristóbal A. Navarro,et al. Analyzing GPU Tensor Core Potential for Fast Reductions , 2018, 2018 37th International Conference of the Chilean Computer Science Society (SCCC).
[33] Jürgen Schmidhuber,et al. Deep learning in neural networks: An overview , 2014, Neural Networks.
[34] Jeroen Bédorf,et al. A sparse octree gravitational N-body code that runs entirely on the GPU processor , 2011, J. Comput. Phys..
[35] Marco Maggioni,et al. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.
[36] Rajesh Gupta,et al. Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs , 2017, FPGA.
[37] Martin Margala,et al. Exploration of Low Numeric Precision Deep Learning Inference Using Intel® FPGAs , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).
[38] Jinjun Xiong,et al. Accelerating reduction and scan using tensor core units , 2018, ICS.