论文信息 - WACO: Learning Workload-Aware Co-optimization of the Format and Schedule of a Sparse Tensor Program

WACO: Learning Workload-Aware Co-optimization of the Format and Schedule of a Sparse Tensor Program

In this paper, we present WACO, a novel method of co-optimizing the format and the schedule of a given sparsity pattern in a sparse tensor program. A core challenge in this paper is the design of a lightweight cost model that accurately predicts the runtime of a sparse tensor program by considering the sparsity pattern, the format, and the schedule. The key idea in addressing this is exploiting a sparse convolutional network to learn meaningful features of the sparsity pattern and embedding a coupled behavior between the format and the schedule using a specially designed schedule template. In addition, within the enormous search space of co-optimization, our novel search strategy, an approximate nearest neighbor search, efficiently and accurately retrieves the best format and schedule for a given sparsity pattern. We evaluated WACO for four different algorithms (SpMV, SpMM, SDDMM, and MTTKRP) on a CPU using 726 different sparsity patterns. Our experimental results showed that WACO outperformed four state-of-the-art baselines, Intel MKL, BestFormat, TACO with a default schedule, and ASpT. Compared to the best of four baselines, WACO achieved 1.43×, 1.18×, 1.14×, and 1.27× average speedups on SpMV, SpMM, SDDMM, and MTTKRP, respectively.

Saman P. Amarasinghe | J. Emer | Charith Mendis | Jaeyeon Won

[1] Clayton D. Scott,et al. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2] James Demmel,et al. CoSA: Scheduling by Constrained Optimization for Spatial Accelerators , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[3] Karima Benatchba,et al. A Deep Learning Based Cost Model for Automatic Code Optimization , 2021, MLSys.

[4] Vikas Chandra,et al. Mind mappings: enabling efficient algorithm-accelerator mapping space search , 2021, ASPLOS.

[5] Niladrish Chatterjee,et al. Learning Sparse Matrix Row Permutations for Efficient SpMM on GPU Architectures , 2021, 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[6] Shoaib Kamil,et al. A sparse iteration space transformation framework for sparse tensor algebra , 2020, Proc. ACM Program. Lang..

[7] L. Gan,et al. SpTFS: Sparse Tensor Format Selection for MTTKRP via Deep Learning , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[8] Yu Wang,et al. GE-SpMM: General-Purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural Networks , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9] Erich Elsen,et al. Sparse GPU Kernels for Deep Learning , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[10] V. Sze,et al. Efficient Processing of Deep Neural Networks , 2020, Synthesis Lectures on Computer Architecture.

[11] Cody Hao Yu,et al. Ansor : Generating High-Performance Tensor Programs for Deep Learning , 2020, OSDI.

[12] Yun Liang,et al. FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System , 2020, ASPLOS.

[13] Shulong Tan,et al. Fast Item Ranking under Neural Network based Measures , 2020, WSDM.

[14] Wang Chen,et al. Enabling Runtime SpMV Format Selection through an Overhead Conscious Method , 2020, IEEE Transactions on Parallel and Distributed Systems.

[15] Alexander Aiken,et al. TASO: optimizing deep learning computation with automatic generation of graph substitutions , 2019, SOSP.

[16] Frédo Durand,et al. Learning to optimize halide with tree search and random programs , 2019, ACM Trans. Graph..

[17] Cesare Alippi,et al. Spectral Clustering with Graph Neural Networks for Graph Pooling , 2019, ICML.

[18] Heidi K. Thornquist,et al. Polynomial Preconditioned GMRES to Reduce Communication in Parallel Computing , 2019, ArXiv.

[19] Brucek Khailany,et al. Timeloop: A Systematic Approach to DNN Accelerator Evaluation , 2019, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[20] P. Sadayappan,et al. Adaptive sparse tiling for sparse matrix multiplication , 2019, PPoPP.

[21] Michael Carbin,et al. Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks , 2018, ICML.

[22] Thierry Moreau,et al. Learning to Optimize Tensor Programs , 2018, NeurIPS.

[23] Yue Zhao,et al. Overhead-Conscious Format Selection for SpMV-Based Applications , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[24] Shoaib Kamil,et al. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code , 2018, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[25] Saman P. Amarasinghe,et al. Format abstraction for sparse tensor algebra compilers , 2018, Proc. ACM Program. Lang..

[26] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[27] Yue Zhao,et al. Bridging the gap between deep learning and sparse matrix format selection , 2018, PPoPP.

[28] Jonathan Ragan-Kelley,et al. Halide , 2017 .

[29] Shoaib Kamil,et al. The tensor algebra compiler , 2017, Proc. ACM Program. Lang..

[30] Laurens van der Maaten,et al. Submanifold Sparse Convolutional Networks , 2017, ArXiv.

[31] Xuemin Lin,et al. Approximate Nearest Neighbor Search on High Dimensional Data — Experiments, Analyses, and Improvement , 2016, IEEE Transactions on Knowledge and Data Engineering.

[32] Jonathan Ragan-Kelley,et al. Automatically scheduling halide image processing pipelines , 2016, ACM Trans. Graph..

[33] Franz Franchetti,et al. Mathematical foundations of the GraphBLAS , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[34] Wojciech Matusik,et al. Simit , 2016, ACM Trans. Graph..

[35] Yury A. Malkov,et al. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36] Anders Logg,et al. The FEniCS Project Version 1.5 , 2015 .

[37] Srinivasan Parthasarathy,et al. Automatic Selection of Sparse Matrix Representation on GPUs , 2015, ICS.

[38] Mary W. Hall,et al. Loop and data transformations for sparse matrix code , 2015, PLDI.

[39] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[40] Shoaib Kamil,et al. OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[41] David D. Cox,et al. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures , 2013, ICML.

[42] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI.

[43] Ninghui Sun,et al. SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication , 2013, PLDI.

[44] Xing Liu,et al. Efficient sparse matrix-vector multiplication on x86-based many-core processors , 2013, ICS '13.

[45] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.

[46] Timothy A. Davis,et al. Algorithm 915, SuiteSparseQR: Multifrontal multithreaded rank-revealing sparse QR factorization , 2011, TOMS.

[47] Richard W. Vuduc,et al. Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[48] Gregory N. Hullender,et al. Learning to rank using gradient descent , 2005, ICML.

[49] Katherine Yelick,et al. OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[50] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[51] Steven G. Johnson,et al. FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[52] Sameer A. Nene,et al. A simple algorithm for nearest neighbor search in high dimensions , 1997 .

[53] Samuel J. Kaufman,et al. Learned TPU Cost Model for XLA Tensor Programs , 2019 .