论文信息 - ALTO: adaptive linearized storage of sparse tensors

ALTO: adaptive linearized storage of sparse tensors

The analysis of high-dimensional sparse data is becoming increasingly popular in many important domains. However, real-world sparse tensors are challenging to process due to their irregular shapes and data distributions. We propose the Adaptive Linearized Tensor Order (ALTO) format, a novel mode-agnostic (general) representation that keeps neighboring nonzero elements in the multi-dimensional space close to each other in memory. To generate the indexing metadata, ALTO uses an adaptive bit encoding scheme that trades off index computations for lower memory usage and more effective use of memory bandwidth. Moreover, by decoupling its sparse representation from the irregular spatial distribution of nonzero elements, ALTO eliminates the workload imbalance and greatly reduces the synchronization overhead of tensor computations. As a result, the parallel performance of ALTO-based tensor operations becomes a function of their inherent data reuse. On a gamut of tensor datasets, ALTO outperforms an oracle that selects the best state-of-the-art format for each dataset, when used in key tensor decomposition operations. Specifically, ALTO achieves a geometric mean speedup of 8x over the best mode-agnostic (coordinate and hierarchical coordinate) formats, while delivering a geometric mean compression ratio of 4.x relative to the best mode-specific (compressed sparse fiber) formats.

[1] Marcin Paprzycki,et al. On BLAS Operations with Recursively Stored Sparse Matrices , 2010, 2010 12th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing.

[2] Sparsh Mittal. A survey of techniques for designing and managing CPU register file , 2017, Concurr. Comput. Pract. Exp..

[3] Yue Zhao,et al. Bridging the gap between deep learning and sparse matrix format selection , 2018, PPoPP.

[4] Adam P. Harrison,et al. High Performance Rearrangement and Multiplication Routines for Sparse Tensor Arithmetic , 2018, SIAM J. Sci. Comput..

[5] Paolo Bientinesi,et al. Recursive Algorithms for Dense Linear Algebra: The ReLAPACK Collection , 2016 .

[6] Emilio Ferrara,et al. Extracting the multi-timescale activity patterns of online financial markets , 2018, Scientific Reports.

[7] Joost VandeVondele,et al. Sparse matrix multiplication: The distributed block-compressed sparse row library , 2014, Parallel Comput..

[8] Anand D. Sarwate,et al. A Unified Optimization Approach for Sparse Tensor Operations on GPUs , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[9] Anima Anandkumar,et al. Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[10] George Karypis,et al. Tensor-matrix products with a compressed sparse tensor , 2015, IA3@SC.

[11] Xing Liu,et al. Blocking Optimization Techniques for Sparse Tensor Computation , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[12] Jan Reineke,et al. uops.info: Characterizing Latency, Throughput, and Port Usage of Instructions on Intel Microarchitectures , 2018, ASPLOS.

[13] Nico Vervliet,et al. Tensorlab 3.0 — Numerical optimization strategies for large-scale constrained and coupled matrix/tensor factorization , 2016, 2016 50th Asilomar Conference on Signals, Systems and Computers.

[14] Charles E. Leiserson,et al. Cache-Oblivious Algorithms , 2003, CIAC.

[15] Weifeng Liu,et al. Parallel Transposition of Sparse Data Structures , 2016, ICS.

[16] Tamara G. Kolda,et al. Efficient MATLAB Computations with Sparse and Factored Tensors , 2007, SIAM J. Sci. Comput..

[17] J. H. Choi,et al. DFacTo: Distributed Factorization of Tensors , 2014, NIPS.

[18] Sriram Krishnamoorthy,et al. An efficient mixed-mode representation of sparse tensors , 2019, SC.

[19] Shuangzhe Liu,et al. Hadamard, Khatri-Rao, Kronecker and Other Matrix Products , 2008 .

[20] Gerhard Wellein,et al. Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels , 2019, 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).

[21] Tamara G. Kolda,et al. Tensor Decompositions and Applications , 2009, SIAM Rev..

[22] Jiajia Li,et al. Sparta: high-performance, element-wise sparse tensor contraction on heterogeneous memory , 2021, PPoPP.

[23] Onur Mutlu,et al. Demystifying Complex Workload-DRAM Interactions: An Experimental Study , 2019, SIGMETRICS.

[24] Nikos D. Sidiropoulos,et al. Tensor Decomposition for Signal Processing and Machine Learning , 2016, IEEE Transactions on Signal Processing.

[25] Jeremy D. Frens,et al. Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code , 1997, PPOPP '97.

[26] John R. Gilbert,et al. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[27] Christos Faloutsos,et al. HaTen2: Billion-scale tensor decompositions , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[28] Nikos D. Sidiropoulos,et al. SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[29] Jimeng Sun,et al. Marble: high-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization , 2014, KDD.

[30] Saman P. Amarasinghe,et al. Format abstraction for sparse tensor algebra compilers , 2018, Proc. ACM Program. Lang..

[31] Rob H. Bisseling,et al. Two-dimensional cache-oblivious sparse matrix-vector multiplication , 2011, Parallel Comput..

[32] T. van Amelsvoort. Bridging the Gap , 2014, Tijdschrift voor psychiatrie.

[33] Erik Elmroth,et al. SIAM REVIEW c ○ 2004 Society for Industrial and Applied Mathematics Vol. 46, No. 1, pp. 3–45 Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software ∗ , 2022 .

[34] L. Dagum,et al. OpenMP: an industry standard API for shared-memory programming , 1998 .

[35] Patrick Flick,et al. High Performance Streaming Tensor Decomposition , 2021, 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[36] Richard W. Vuduc,et al. Load-Balanced Sparse MTTKRP on GPUs , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[37] Fred G. Gustavson,et al. Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..

[38] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[39] Peter Ahrens,et al. Sparse Tensor Transpositions , 2020, SPAA.

[40] Christos Faloutsos,et al. GigaTensor: scaling tensor analysis up by 100 times - algorithms and discoveries , 2012, KDD.

[41] Michele Martone,et al. Efficient multithreaded untransposed, transposed or symmetric sparse matrix-vector multiplication with the Recursive Sparse Blocks format , 2014, Parallel Comput..

[42] Nikos D. Sidiropoulos,et al. Streaming Tensor Factorization for Infinite Data Sources , 2018, SDM.

[43] Hadi Fanaee-T,et al. Tensor-based anomaly detection: An interdisciplinary survey , 2016, Knowl. Based Syst..

[44] Tamara G. Kolda,et al. Scalable Tensor Decompositions for Multi-aspect Data Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[45] Nikos D. Sidiropoulos,et al. Tensors for Data Mining and Data Fusion , 2016, ACM Trans. Intell. Syst. Technol..

[46] Jimeng Sun,et al. Efficient and effective sparse tensor reordering , 2019, ICS.

[47] George Karypis,et al. Accelerating the Tucker Decomposition with Compressed Sparse Tensors , 2017, Euro-Par.

[48] Alfio Lazzaro,et al. DBCSR: A Blocked Sparse Tensor Algebra Library , 2019, PARCO.

[49] L. Gan,et al. SpTFS: Sparse Tensor Format Selection for MTTKRP via Deep Learning , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[50] John F. Stanton,et al. A massively parallel tensor contraction framework for coupled-cluster computations , 2014, J. Parallel Distributed Comput..

[51] Mithuna Thottethodi,et al. Recursive Array Layouts and Fast Matrix Multiplication , 2002, IEEE Trans. Parallel Distributed Syst..

[52] Benoît Meister,et al. Efficient and scalable computations with sparse tensors , 2012, 2012 IEEE Conference on High Performance Extreme Computing.

[53] Zhen Xie,et al. IA-SpGEMM: an input-aware auto-tuning framework for parallel sparse matrix-matrix multiplication , 2019, ICS.

[54] G. Peano. Sur une courbe, qui remplit toute une aire plane , 1890 .

[55] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[56] Rob H. Bisseling,et al. Cache-Oblivious Sparse Matrix--Vector Multiplication by Using Sparse Matrix Partitioning Methods , 2009, SIAM J. Sci. Comput..

[57] Jimeng Sun,et al. HiCOO: Hierarchical Storage of Sparse Tensors , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[58] Jimeng Sun,et al. Rubik: Knowledge Guided Tensor Factorization and Completion for Health Data Analytics , 2015, KDD.

[59] Aart J. C. Bik,et al. Automatic Intra-Register Vectorization for the Intel® Architecture , 2002, International Journal of Parallel Programming.

[60] Tamara G. Kolda,et al. Software for Sparse Tensor Decomposition on Emerging Computing Architectures , 2018, SIAM J. Sci. Comput..