swTensor: accelerating tensor decomposition on Sunway architecture

Modern applications are digesting and generating data with rich features that are stored in high dimensional array or tensor. The computation applied to tensor, such as Canonical Polyadic decomposition (CP decomposition) plays an important role in understanding the internal relationships within the data. Using CP decomposition to analyze large tensor with billions of sizes requires tremendous computation power. In the meanwhile, the emerging Sunway many-core processor has demonstrated its computation advantage in powering the first hundred petaFLOPS supercomputer in the world. In this paper, we propose swTensor that adapts the CP decomposition to Sunway processor by leveraging the MapReduce framework for automatic parallelization and the unique architecture of Sunway for high performance. Specifically, we divide the major computation of CP decomposition into four sub-procedures and implement each using MapReduce framework with customized design key-value pair. Also, we tile the data during the computation so that it fits into the limited local device memory on Sunway for better performance. Moreover, we propose a performance auto-tuning mechanism to search for the optimal parameter settings in swTensor. The experimental results demonstrate swTensor achieves better performance than the state-of-the-art BigTensor and CSTF with the average speedup of 1.36 $$\times $$ and 1.24 $$\times $$, respectively. Besides, swTensor exhibits better scalability when scaling across multiple Sunway processors.

[1]  Lee Sael,et al.  High-Performance Tucker Factorization on Heterogeneous Platforms , 2019, IEEE Transactions on Parallel and Distributed Systems.

[2]  Andrzej Cichocki,et al.  Tensor Decompositions for Signal Processing Applications: From two-way to multiway component analysis , 2014, IEEE Signal Processing Magazine.

[3]  J. H. Choi,et al.  DFacTo: Distributed Factorization of Tensors , 2014, NIPS.

[4]  Tamara G. Kolda,et al.  Software for Sparse Tensor Decomposition on Emerging Computing Architectures , 2018, SIAM J. Sci. Comput..

[5]  Kathryn A. Dowsland,et al.  Simulated Annealing , 1989, Encyclopedia of GIS.

[6]  Depei Qian,et al.  Accelerating tile low-rank GEMM on sunway architecture: POSTER , 2019, CF.

[7]  Christos Faloutsos,et al.  FlexiFaCT: Scalable Flexible Factorization of Coupled Tensors on Hadoop , 2014, SDM.

[8]  Christos Faloutsos,et al.  HaTen2: Billion-scale tensor decompositions , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[9]  Xing Liu,et al.  Blocking Optimization Techniques for Sparse Tensor Computation , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[10]  Christos Faloutsos,et al.  Mining billion-scale tensors: algorithms and discoveries , 2016, The VLDB Journal.

[11]  Gene H. Golub,et al.  Matrix computations , 1983 .

[12]  Hans-Peter Kriegel,et al.  Factorizing YAGO: scalable machine learning for linked data , 2012, WWW.

[13]  Nikos D. Sidiropoulos,et al.  Tensor Decomposition for Signal Processing and Machine Learning , 2016, IEEE Transactions on Signal Processing.

[14]  Richard W. Vuduc,et al.  Load-Balanced Sparse MTTKRP on GPUs , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[15]  George Karypis,et al.  Sparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[16]  Meng Zhang,et al.  Redesigning LAMMPS for Peta-Scale and Hundred-Billion-Atom Simulation on Sunway TaihuLight , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Kai-Wei Chang,et al.  Typed Tensor Decomposition of Knowledge Bases for Relation Extraction , 2014, EMNLP.

[18]  Weifeng Liu,et al.  swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures , 2018, PPoPP.

[19]  Xin Liu,et al.  Towards Efficient SpMV on Sunway Manycore Architectures , 2018, ICS.

[20]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[21]  Tat-Seng Chua,et al.  Fast Matrix Factorization for Online Recommendation with Implicit Feedback , 2016, SIGIR.

[22]  Milan Sonka,et al.  Image Processing, Analysis and Machine Vision , 1993, Springer US.

[23]  Wei Zhang,et al.  Simulating the Wenchuan Earthquake with Accurate Surface Topography on Sunway TaihuLight , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[24]  Tamir Hazan,et al.  Non-negative tensor factorization with applications to statistics and computer vision , 2005, ICML.

[25]  Anand D. Sarwate,et al.  A Unified Optimization Approach for Sparse Tensor Operations on GPUs , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[26]  Xuelong Li,et al.  Tensors in Image Processing and Computer Vision , 2009, Advances in Pattern Recognition.

[27]  Pierre Comon,et al.  Multiarray Signal Processing: Tensor decomposition meets compressed sensing , 2010, ArXiv.

[28]  Maryam Mehri Dehnavi,et al.  CSTF: Large-Scale Sparse Tensor Factorizations on Distributed Platforms , 2018, ICPP.

[29]  Parker Allen Tew,et al.  An investigation of sparse tensor formats for tensor libraries , 2016 .

[30]  Guangwen Yang,et al.  Large-Scale Hierarchical k-means for Heterogeneous Many-Core Supercomputers , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[31]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[32]  James Lin,et al.  Benchmarking SW26010 Many-Core Processor , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[33]  Depei Qian,et al.  Multi-role SpTRSV on Sunway Many-Core Architecture , 2018, 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[34]  Christos Faloutsos,et al.  GigaTensor: scaling tensor analysis up by 100 times - algorithms and discoveries , 2012, KDD.

[35]  Jungwoo Lee,et al.  BIGtensor: Mining Billion-Scale Tensor Made Easy , 2016, CIKM.

[36]  Jimeng Sun,et al.  Optimizing sparse tensor times matrix on GPUs , 2019, J. Parallel Distributed Comput..

[37]  Depei Qian,et al.  swMR: A Framework for Accelerating MapReduce Applications on Sunway Taihulight , 2018 .

[38]  Guangwen Yang,et al.  Massively Scaling Seismic Processing on Sunway TaihuLight Supercomputer , 2020, IEEE Transactions on Parallel and Distributed Systems.

[39]  Cheng Lei,et al.  Tri-focal tensor-based multiple video synchronization with subframe optimization , 2006, IEEE Transactions on Image Processing.

[40]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[41]  F. L. Hitchcock The Expression of a Tensor or a Polyadic as a Sum of Products , 1927 .

[42]  Richard W. Vuduc,et al.  Optimizing Sparse Tensor Times Matrix on Multi-core and Many-Core Architectures , 2016, 2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3).

[43]  Tamara G. Kolda,et al.  Scalable Tensor Factorizations for Incomplete Data , 2010, ArXiv.

[44]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.