暂无分享,去创建一个
Shaohuai Shi | Qiang Wang | Xiaowen Chu | S. Shi | Q. Wang | X. Chu | Qiang Wang
[1] Jack Dongarra,et al. Templates for the Solution of Algebraic Eigenvalue Problems , 2000, Software, environments, tools.
[2] Alexander Tiskin,et al. All-Pairs Shortest Paths Computation in the BSP Model , 2001, ICALP.
[3] Cache Oblivious Dense and Sparse Matrix Multiplication Based on Peano Curves , 2008 .
[4] J. Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[5] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.
[6] Michael Garland,et al. Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[7] Xiaowen Chu,et al. Practical Random Linear Network Coding on GPUs , 2009, Networking.
[8] Jack J. Dongarra,et al. An Improved Magma Gemm For Fermi Graphics Processing Units , 2010, Int. J. High Perform. Comput. Appl..
[9] Riko Jacob,et al. The I/O Complexity of Sparse Matrix Dense Matrix Multiplication , 2010, LATIN.
[10] E M Garzón,et al. A matrix approach to tomographic reconstruction and its implementation on GPUs. , 2010, Journal of structural biology.
[11] Ki-Hwan Kim,et al. Performance analysis and optimization of three-dimensional FDTD on GPU using roofline model , 2011, Comput. Phys. Commun..
[12] Tomoya Sakai,et al. Multi-level Optimization of Matrix Multiplication for GPU-equipped Systems , 2011, ICCS.
[13] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.
[14] Bertil Schmidt,et al. The Sliced COO Format for Sparse Matrix-Vector Multiplication on CUDA-enabled GPUs , 2012, ICCS.
[15] Jack J. Dongarra,et al. Autotuning GEMM Kernels for the Fermi GPU , 2012, IEEE Transactions on Parallel and Distributed Systems.
[16] André Seznec,et al. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[17] Joseph L. Greathouse,et al. Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[18] Xinxin Mei,et al. Benchmarking the Memory Hierarchy of Modern GPUs , 2014, NPC.
[19] Francisco Vázquez,et al. FastSpMM: An Efficient Library for Sparse Matrix Matrix Product on GPUs , 2014, Comput. J..
[20] Hassan Foroosh,et al. Sparse Convolutional Neural Networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[21] Jack J. Dongarra,et al. Performance, Design, and Autotuning of Batched GEMM for GPUs , 2016, ISC.
[22] Leonid Oliker,et al. Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[23] Michael Garland,et al. Merge-Based Parallel Sparse Matrix-Vector Multiplication , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[24] Wu-chun Feng,et al. Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[25] Erich Elsen,et al. Exploring Sparsity in Recurrent Neural Networks , 2017, ICLR.
[26] Shaohuai Shi,et al. Speeding up Convolutional Neural Networks By Exploiting the Sparsity of Rectifier Units , 2017, ArXiv.
[27] Yannis Cotronis,et al. A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling , 2017, J. Parallel Distributed Comput..
[28] Xinxin Mei,et al. Dissecting GPU Memory Hierarchy Through Microbenchmarking , 2015, IEEE Transactions on Parallel and Distributed Systems.
[29] Mingyu Chen,et al. Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning , 2017, PPoPP.
[30] Xu Sun,et al. meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting , 2017, ICML.
[31] John D. Owens,et al. Design Principles for Sparse Matrix Multiplication on the GPU , 2018, Euro-Par.
[32] Xiaowen Chu,et al. G-CRS: GPU Accelerated Cauchy Reed-Solomon Coding , 2018, IEEE Transactions on Parallel and Distributed Systems.
[33] Georgiadis Georgios,et al. Accelerating Convolutional Neural Networks via Activation Map Compression , 2018, CVPR 2019.
[34] Fang Liu,et al. Learning Intrinsic Sparse Structures within Long Short-term Memory , 2017, ICLR.
[35] Shaohuai Shi,et al. A Distributed Synchronous SGD Algorithm with Global Top-k Sparsification for Low Bandwidth Networks , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).
[36] Georgios Georgiadis,et al. Accelerating Convolutional Neural Networks via Activation Map Compression , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[37] Gagan Agrawal,et al. A novel data transformation and execution strategy for accelerating sparse matrix multiplication on GPUs , 2020, PPoPP.
[38] Xu Sun,et al. Training Simplification and Model Simplification for Deep Learning : A Minimal Effort Back Propagation Method , 2017, IEEE Transactions on Knowledge and Data Engineering.
[39] Erich Elsen,et al. Sparse GPU Kernels for Deep Learning , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.
[40] Olfa Hamdi-Larbi,et al. Performance Evaluation of Algorithms for Sparse-Dense Matrix Product , 2020 .
[41] Erich Elsen,et al. Fast Sparse ConvNets , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[42] Martin Winter,et al. spECK: accelerating GPU sparse matrix-matrix multiplication through lightweight analysis , 2020, PPoPP.
[43] Xiaowen Chu,et al. Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[44] S. Ezouaoui,et al. Performance Evaluation of Algorithms for Sparse-Dense Matrix Product , .