A Sparse Tensor Benchmark Suite for CPUs and GPUs

Tensor computations present significant performance challenges that impact a wide spectrum of applications ranging from machine learning, healthcare analytics, social network analysis, data mining to quantum chemistry and signal processing. Efforts to improve the performance of tensor computations include exploring data layout, execution scheduling, and parallelism in common tensor kernels. This work presents a benchmark suite for arbitrary-order sparse tensor kernels using state-of-the-art tensor formats: coordinate (COO) and hierarchical coordinate (HiCOO) on CPUs and GPUs. It presents a set of reference tensor kernel implementations that are compatible with real-world tensors and power law tensors extended from synthetic graph generation techniques. We also propose Roofline performance models for these kernels to provide insights of computer platforms from sparse tensor view. This benchmark suite along with the synthetic tensor generator is publicly available.

[1]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[2]  Jimeng Sun,et al.  HiCOO: Hierarchical Storage of Sparse Tensors , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  David A. Patterson,et al.  The GAP Benchmark Suite , 2015, ArXiv.

[4]  Fei Wang,et al.  SPARTan: Scalable PARAFAC2 for Large & Sparse Data , 2017, KDD.

[5]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[6]  Paolo Bientinesi,et al.  HPTT: a high-performance tensor transposition C++ library , 2017, ARRAY@PLDI.

[7]  Andrzej Cichocki,et al.  Era of Big Data Processing: A New Approach via Tensor Networks and Tensor Decompositions , 2014, ArXiv.

[8]  J. Chang,et al.  Analysis of individual differences in multidimensional scaling via an n-way generalization of “Eckart-Young” decomposition , 1970 .

[9]  Kaivalya M. Dixit,et al.  The SPEC benchmarks , 1991, Parallel Comput..

[10]  J. Kruskal,et al.  Candelinc: A general approach to multidimensional analysis of many-way arrays with linear constraints on parameters , 1980 .

[11]  Hadi Fanaee-T,et al.  SimTensor: A synthetic tensor data generator , 2016, ArXiv.

[12]  Shoaib Kamil,et al.  The tensor algebra compiler , 2017, Proc. ACM Program. Lang..

[13]  Steve Plimpton,et al.  FireHose Streaming Benchmarks , 2015 .

[14]  Jiajia Li Scalable tensor decompositions in high performance computing environments , 2018 .

[15]  Andrzej Cichocki,et al.  Low-Rank Tensor Networks for Dimensionality Reduction and Large-Scale Optimization Problems: Perspectives and Challenges PART 1 , 2016, ArXiv.

[16]  Jesús Labarta,et al.  A Framework for Performance Modeling and Prediction , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[17]  Devin Matthews,et al.  High-Performance Tensor Contraction without BLAS , 2016, ArXiv.

[18]  Jimeng Sun,et al.  Efficient and effective sparse tensor reordering , 2019, ICS.

[19]  Srinivasan Parthasarathy,et al.  Automatic Selection of Sparse Matrix Representation on GPUs , 2015, ICS.

[20]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[21]  Brian W. Barrett,et al.  Introducing the Graph 500 , 2010 .

[22]  Mingyu Chen,et al.  Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning , 2017, PPoPP.

[23]  Bora Uçar,et al.  Parallel Candecomp/Parafac Decomposition of Sparse Tensors Using Dimension Trees , 2018, SIAM J. Sci. Comput..

[24]  Richard A. Lethin,et al.  Highly Scalable Near Memory Processing with Migrating Threads on the Emu System Architecture , 2016, 2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3).

[25]  Richard W. Vuduc,et al.  An Initial Characterization of the Emu Chick , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[26]  Alexander Novikov,et al.  Tensorizing Neural Networks , 2015, NIPS.

[27]  Benoît Meister,et al.  Efficient and scalable computations with sparse tensors , 2012, 2012 IEEE Conference on High Performance Extreme Computing.

[28]  Nikos D. Sidiropoulos,et al.  SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[29]  Nikos D. Sidiropoulos,et al.  Tensor Decomposition for Signal Processing and Machine Learning , 2016, IEEE Transactions on Signal Processing.

[30]  Edoardo Di Napoli,et al.  Towards an efficient use of the BLAS library for multilinear tensor contractions , 2013, Appl. Math. Comput..

[31]  Anand D. Sarwate,et al.  A Unified Optimization Approach for Sparse Tensor Operations on GPUs , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[32]  Samuel Williams,et al.  Auto-tuning performance on multicore computers , 2008 .

[33]  Richard W. Vuduc,et al.  Load-Balanced Sparse MTTKRP on GPUs , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[34]  Andrzej Cichocki,et al.  Tensor Networks for Dimensionality Reduction and Large-scale Optimization: Part 1 Low-Rank Tensor Decompositions , 2016, Found. Trends Mach. Learn..

[35]  Sriram Krishnamoorthy,et al.  An efficient mixed-mode representation of sparse tensors , 2019, SC.

[36]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[37]  Xu Liu,et al.  Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU Benchmark Suite , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[38]  Saman P. Amarasinghe,et al.  Format abstraction for sparse tensor algebra compilers , 2018, Proc. ACM Program. Lang..

[39]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[40]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[41]  Samuel Williams,et al.  Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis , 2014, PMBS@SC.

[42]  Jimeng Sun,et al.  Model-Driven Sparse CP Decomposition for Higher-Order Tensors , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[43]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[44]  Ivan V. Oseledets,et al.  Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition , 2014, ICLR.

[45]  Ang Li,et al.  PASTA: a parallel sparse tensor algorithm benchmark suite , 2019, CCF Transactions on High Performance Computing.

[46]  L. Tucker,et al.  Some mathematical notes on three-mode factor analysis , 1966, Psychometrika.

[47]  Christos Faloutsos,et al.  Kronecker Graphs: An Approach to Modeling Networks , 2008, J. Mach. Learn. Res..

[48]  Richard W. Vuduc,et al.  Optimizing Sparse Tensor Times Matrix on Multi-core and Many-Core Architectures , 2016, 2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3).

[49]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Olivier Richard,et al.  CONCURRENCY AND COMPUTATION : PRACTICE AND EXPERIENCE , 2018 .

[51]  Rasmus Bro,et al.  The N-way Toolbox for MATLAB , 2000 .

[52]  Ninghui Sun,et al.  SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication , 2013, PLDI.

[53]  Yisong Yue,et al.  Long-term Forecasting using Tensor-Train RNNs , 2017, ArXiv.

[54]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[55]  Jimeng Sun,et al.  An input-adaptive and in-place approach to dense tensor-times-matrix multiply , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[56]  Georg Ofenbeck,et al.  Applying the roofline model , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[57]  Jimeng Sun,et al.  Optimizing sparse tensor times matrix on GPUs , 2019, J. Parallel Distributed Comput..