Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor

We present a systematic methodology for optimizing batched matrix multiplications on SW26010 many-core processor of the Sunway TaihuLight supercomputer. Five surrogate algorithms and a machine learning–based algorithm selector are proposed to fully exploit the computing capability of SW26010 and cope with the sophisticated algorithm characteristics of batched matrix multiplications. Experiment results show that the algorithm selector is able to adaptively choose the appropriate algorithm for various matrix shapes and batch sizes with low overhead and high accuracy. In particular, the optimized batched matrix multiplications can substantially outperform the non-batched version and reach around 84.8% of the performance upper bound.

[1]  Ali R. Zomorrodi,et al.  A kinetic model of Escherichia coli core metabolism satisfying multiple sets of mutant flux data. , 2014, Metabolic engineering.

[2]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[3]  Martin D. Schatz,et al.  Parallel Matrix Multiplication: A Systematic Journey , 2016, SIAM J. Sci. Comput..

[4]  Weiguo Liu,et al.  18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of 18-Hz and 8-Meter Scenarios , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Merek A. Chertkow,et al.  Multicore and Accelerator Development for a Leadership-Class Stellar Astrophysics Code , 2012, PARA.

[6]  Satoshi Matsuoka,et al.  Optimizations of Two Compute-Bound Scientific Kernels on the SW26010 Many-Core Processor , 2017, 2017 46th International Conference on Parallel Processing (ICPP).

[7]  Xiuhong Li,et al.  Efficient kernel management on GPUs , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[8]  V. Strassen Gaussian elimination is not optimal , 1969 .

[9]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[10]  G. Karniadakis,et al.  Spectral/hp Element Methods for Computational Fluid Dynamics , 2005 .

[11]  Hui Lv,et al.  Cooperative Computing Techniques for a Deeply Fused and Heterogeneous Many-Core Processor Architecture , 2015, Journal of Computer Science and Technology.

[12]  Austin R. Benson,et al.  A framework for practical parallel fast matrix multiplication , 2014, PPoPP.

[13]  Nick Knupffer Intel Corporation , 2018, The Grants Register 2019.

[14]  Jack J. Dongarra,et al.  An Improved Magma Gemm For Fermi Graphics Processing Units , 2010, Int. J. High Perform. Comput. Appl..

[15]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[16]  Jack Dongarra,et al.  A Proposed API for Batched Basic Linear Algebra Subprograms , 2016 .

[17]  Jack J. Dongarra,et al.  High-Performance Tensor Contractions for GPUs , 2016, ICCS.

[18]  Pradeep Dubey,et al.  Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[19]  Anima Anandkumar,et al.  Tensor Contractions with Extended BLAS Kernels on CPU and GPU , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).

[20]  Alexander Heinecke,et al.  LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[22]  Chao Yang,et al.  10M-Core Scalable Fully-Implicit Solver for Nonhydrostatic Atmospheric Dynamics , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Pawel Gepner,et al.  Evaluation of DGEMM Implementation on Intel Xeon Phi Coprocessor , 2014, J. Comput..

[24]  Chetan Jhurani,et al.  A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices , 2013, J. Parallel Distributed Comput..

[25]  Ninghui Sun,et al.  Fast implementation of DGEMM on Fermi GPU , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[26]  Jack J. Dongarra,et al.  High-Performance Matrix-Matrix Multiplications of Very Small Matrices , 2016, Euro-Par.

[27]  Halbert White,et al.  Artificial Neural Networks: Approximation and Learning Theory , 1992 .

[28]  Jack J. Dongarra,et al.  Algorithms and optimization techniques for high-performance matrix-matrix multiplications of very small matrices , 2019, Parallel Comput..

[29]  Naohito Nakasato,et al.  A fast GEMM implementation on the cypress GPU , 2011, PERV.

[30]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[31]  David E. Bernholdt,et al.  Automatic code generation for many-body electronic structure methods: the tensor contraction engine , 2006 .

[32]  Jack Dongarra,et al.  Sunway TaihuLight supercomputer makes its appearance , 2016 .

[33]  Robert A. van de Geijn,et al.  Strassen's Algorithm Reloaded , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[34]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[35]  Peng Zhang,et al.  Towards Highly Efficient DGEMM on the Emerging SW26010 Many-Core Processor , 2017, 2017 46th International Conference on Parallel Processing (ICPP).

[36]  Jian Zhang,et al.  Extreme-Scale Phase Field Simulations of Coarsening Dynamics on the Sunway TaihuLight Supercomputer , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[37]  Liancheng Jia,et al.  A coordinated tiling and batching framework for efficient GEMM on GPUs , 2019, PPoPP.

[38]  Jack J. Dongarra,et al.  The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems , 2017, ICCS.

[39]  Yun Liang,et al.  Efficient GPU Spatial-Temporal Multitasking , 2015, IEEE Transactions on Parallel and Distributed Systems.

[40]  Jack J. Dongarra,et al.  Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs , 2017, ICS.

[41]  Wei Ge,et al.  The Sunway TaihuLight supercomputer: system and applications , 2016, Science China Information Sciences.

[42]  Robert A. van de Geijn,et al.  Generating Families of Practical Fast Matrix Multiplication Algorithms , 2016, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[43]  Emmanuel Agullo,et al.  On the Resilience of Parallel Sparse Hybrid Solvers , 2015, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC).

[44]  Robert A. van de Geijn,et al.  A Family of High-Performance Matrix Multiplication Algorithms , 2001, International Conference on Computational Science.

[45]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[46]  Jack J. Dongarra,et al.  Performance, Design, and Autotuning of Batched GEMM for GPUs , 2016, ISC.