WACO: Learning Workload-Aware Co-optimization of the Format and Schedule of a Sparse Tensor Program

In this paper, we present WACO, a novel method of co-optimizing the format and the schedule of a given sparsity pattern in a sparse tensor program. A core challenge in this paper is the design of a lightweight cost model that accurately predicts the runtime of a sparse tensor program by considering the sparsity pattern, the format, and the schedule. The key idea in addressing this is exploiting a sparse convolutional network to learn meaningful features of the sparsity pattern and embedding a coupled behavior between the format and the schedule using a specially designed schedule template. In addition, within the enormous search space of co-optimization, our novel search strategy, an approximate nearest neighbor search, efficiently and accurately retrieves the best format and schedule for a given sparsity pattern. We evaluated WACO for four different algorithms (SpMV, SpMM, SDDMM, and MTTKRP) on a CPU using 726 different sparsity patterns. Our experimental results showed that WACO outperformed four state-of-the-art baselines, Intel MKL, BestFormat, TACO with a default schedule, and ASpT. Compared to the best of four baselines, WACO achieved 1.43×, 1.18×, 1.14×, and 1.27× average speedups on SpMV, SpMM, SDDMM, and MTTKRP, respectively.

[1]  Clayton D. Scott,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  James Demmel,et al.  CoSA: Scheduling by Constrained Optimization for Spatial Accelerators , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[3]  Karima Benatchba,et al.  A Deep Learning Based Cost Model for Automatic Code Optimization , 2021, MLSys.

[4]  Vikas Chandra,et al.  Mind mappings: enabling efficient algorithm-accelerator mapping space search , 2021, ASPLOS.

[5]  Niladrish Chatterjee,et al.  Learning Sparse Matrix Row Permutations for Efficient SpMM on GPU Architectures , 2021, 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[6]  Shoaib Kamil,et al.  A sparse iteration space transformation framework for sparse tensor algebra , 2020, Proc. ACM Program. Lang..

[7]  L. Gan,et al.  SpTFS: Sparse Tensor Format Selection for MTTKRP via Deep Learning , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Yu Wang,et al.  GE-SpMM: General-Purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural Networks , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Erich Elsen,et al.  Sparse GPU Kernels for Deep Learning , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  V. Sze,et al.  Efficient Processing of Deep Neural Networks , 2020, Synthesis Lectures on Computer Architecture.

[11]  Cody Hao Yu,et al.  Ansor : Generating High-Performance Tensor Programs for Deep Learning , 2020, OSDI.

[12]  Yun Liang,et al.  FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System , 2020, ASPLOS.

[13]  Shulong Tan,et al.  Fast Item Ranking under Neural Network based Measures , 2020, WSDM.

[14]  Wang Chen,et al.  Enabling Runtime SpMV Format Selection through an Overhead Conscious Method , 2020, IEEE Transactions on Parallel and Distributed Systems.

[15]  Alexander Aiken,et al.  TASO: optimizing deep learning computation with automatic generation of graph substitutions , 2019, SOSP.

[16]  Frédo Durand,et al.  Learning to optimize halide with tree search and random programs , 2019, ACM Trans. Graph..

[17]  Cesare Alippi,et al.  Spectral Clustering with Graph Neural Networks for Graph Pooling , 2019, ICML.

[18]  Heidi K. Thornquist,et al.  Polynomial Preconditioned GMRES to Reduce Communication in Parallel Computing , 2019, ArXiv.

[19]  Brucek Khailany,et al.  Timeloop: A Systematic Approach to DNN Accelerator Evaluation , 2019, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[20]  P. Sadayappan,et al.  Adaptive sparse tiling for sparse matrix multiplication , 2019, PPoPP.

[21]  Michael Carbin,et al.  Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks , 2018, ICML.

[22]  Thierry Moreau,et al.  Learning to Optimize Tensor Programs , 2018, NeurIPS.

[23]  Yue Zhao,et al.  Overhead-Conscious Format Selection for SpMV-Based Applications , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[24]  Shoaib Kamil,et al.  Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code , 2018, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[25]  Saman P. Amarasinghe,et al.  Format abstraction for sparse tensor algebra compilers , 2018, Proc. ACM Program. Lang..

[26]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[27]  Yue Zhao,et al.  Bridging the gap between deep learning and sparse matrix format selection , 2018, PPoPP.

[28]  Jonathan Ragan-Kelley,et al.  Halide , 2017 .

[29]  Shoaib Kamil,et al.  The tensor algebra compiler , 2017, Proc. ACM Program. Lang..

[30]  Laurens van der Maaten,et al.  Submanifold Sparse Convolutional Networks , 2017, ArXiv.

[31]  Xuemin Lin,et al.  Approximate Nearest Neighbor Search on High Dimensional Data — Experiments, Analyses, and Improvement , 2016, IEEE Transactions on Knowledge and Data Engineering.

[32]  Jonathan Ragan-Kelley,et al.  Automatically scheduling halide image processing pipelines , 2016, ACM Trans. Graph..

[33]  Franz Franchetti,et al.  Mathematical foundations of the GraphBLAS , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[34]  Wojciech Matusik,et al.  Simit , 2016, ACM Trans. Graph..

[35]  Yury A. Malkov,et al.  Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Anders Logg,et al.  The FEniCS Project Version 1.5 , 2015 .

[37]  Srinivasan Parthasarathy,et al.  Automatic Selection of Sparse Matrix Representation on GPUs , 2015, ICS.

[38]  Mary W. Hall,et al.  Loop and data transformations for sparse matrix code , 2015, PLDI.

[39]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[40]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[41]  David D. Cox,et al.  Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures , 2013, ICML.

[42]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI.

[43]  Ninghui Sun,et al.  SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication , 2013, PLDI.

[44]  Xing Liu,et al.  Efficient sparse matrix-vector multiplication on x86-based many-core processors , 2013, ICS '13.

[45]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[46]  Timothy A. Davis,et al.  Algorithm 915, SuiteSparseQR: Multifrontal multithreaded rank-revealing sparse QR factorization , 2011, TOMS.

[47]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[48]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[49]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[50]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[51]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[52]  Sameer A. Nene,et al.  A simple algorithm for nearest neighbor search in high dimensions , 1997 .

[53]  Samuel J. Kaufman,et al.  Learned TPU Cost Model for XLA Tensor Programs , 2019 .