Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor
暂无分享,去创建一个
[1] Ali R. Zomorrodi,et al. A kinetic model of Escherichia coli core metabolism satisfying multiple sets of mutant flux data. , 2014, Metabolic engineering.
[2] Jack J. Dongarra,et al. A set of level 3 basic linear algebra subprograms , 1990, TOMS.
[3] Martin D. Schatz,et al. Parallel Matrix Multiplication: A Systematic Journey , 2016, SIAM J. Sci. Comput..
[4] Weiguo Liu,et al. 18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of 18-Hz and 8-Meter Scenarios , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[5] Merek A. Chertkow,et al. Multicore and Accelerator Development for a Leadership-Class Stellar Astrophysics Code , 2012, PARA.
[6] Satoshi Matsuoka,et al. Optimizations of Two Compute-Bound Scientific Kernels on the SW26010 Many-Core Processor , 2017, 2017 46th International Conference on Parallel Processing (ICPP).
[7] Xiuhong Li,et al. Efficient kernel management on GPUs , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).
[8] V. Strassen. Gaussian elimination is not optimal , 1969 .
[9] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.
[10] G. Karniadakis,et al. Spectral/hp Element Methods for Computational Fluid Dynamics , 2005 .
[11] Hui Lv,et al. Cooperative Computing Techniques for a Deeply Fused and Heterogeneous Many-Core Processor Architecture , 2015, Journal of Computer Science and Technology.
[12] Austin R. Benson,et al. A framework for practical parallel fast matrix multiplication , 2014, PPoPP.
[13] Nick Knupffer. Intel Corporation , 2018, The Grants Register 2019.
[14] Jack J. Dongarra,et al. An Improved Magma Gemm For Fermi Graphics Processing Units , 2010, Int. J. High Perform. Comput. Appl..
[15] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.
[16] Jack Dongarra,et al. A Proposed API for Batched Basic Linear Algebra Subprograms , 2016 .
[17] Jack J. Dongarra,et al. High-Performance Tensor Contractions for GPUs , 2016, ICCS.
[18] Pradeep Dubey,et al. Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[19] Anima Anandkumar,et al. Tensor Contractions with Extended BLAS Kernels on CPU and GPU , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).
[20] Alexander Heinecke,et al. LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[21] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.
[22] Chao Yang,et al. 10M-Core Scalable Fully-Implicit Solver for Nonhydrostatic Atmospheric Dynamics , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[23] Pawel Gepner,et al. Evaluation of DGEMM Implementation on Intel Xeon Phi Coprocessor , 2014, J. Comput..
[24] Chetan Jhurani,et al. A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices , 2013, J. Parallel Distributed Comput..
[25] Ninghui Sun,et al. Fast implementation of DGEMM on Fermi GPU , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[26] Jack J. Dongarra,et al. High-Performance Matrix-Matrix Multiplications of Very Small Matrices , 2016, Euro-Par.
[27] Halbert White,et al. Artificial Neural Networks: Approximation and Learning Theory , 1992 .
[28] Jack J. Dongarra,et al. Algorithms and optimization techniques for high-performance matrix-matrix multiplications of very small matrices , 2019, Parallel Comput..
[29] Naohito Nakasato,et al. A fast GEMM implementation on the cypress GPU , 2011, PERV.
[30] John Salvatier,et al. Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.
[31] David E. Bernholdt,et al. Automatic code generation for many-body electronic structure methods: the tensor contraction engine , 2006 .
[32] Jack Dongarra,et al. Sunway TaihuLight supercomputer makes its appearance , 2016 .
[33] Robert A. van de Geijn,et al. Strassen's Algorithm Reloaded , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[34] N. Altman. An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .
[35] Peng Zhang,et al. Towards Highly Efficient DGEMM on the Emerging SW26010 Many-Core Processor , 2017, 2017 46th International Conference on Parallel Processing (ICPP).
[36] Jian Zhang,et al. Extreme-Scale Phase Field Simulations of Coarsening Dynamics on the Sunway TaihuLight Supercomputer , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[37] Liancheng Jia,et al. A coordinated tiling and batching framework for efficient GEMM on GPUs , 2019, PPoPP.
[38] Jack J. Dongarra,et al. The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems , 2017, ICCS.
[39] Yun Liang,et al. Efficient GPU Spatial-Temporal Multitasking , 2015, IEEE Transactions on Parallel and Distributed Systems.
[40] Jack J. Dongarra,et al. Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs , 2017, ICS.
[41] Wei Ge,et al. The Sunway TaihuLight supercomputer: system and applications , 2016, Science China Information Sciences.
[42] Robert A. van de Geijn,et al. Generating Families of Practical Fast Matrix Multiplication Algorithms , 2016, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[43] Emmanuel Agullo,et al. On the Resilience of Parallel Sparse Hybrid Solvers , 2015, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC).
[44] Robert A. van de Geijn,et al. A Family of High-Performance Matrix Multiplication Algorithms , 2001, International Conference on Computational Science.
[45] Wei-Yin Loh,et al. Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..
[46] Jack J. Dongarra,et al. Performance, Design, and Autotuning of Batched GEMM for GPUs , 2016, ISC.