Tucker Tensor Decomposition on FPGA

Tensor computation has emerged as a powerful mathematical tool for solving high-dimensional and/or extreme-scale problems in science and engineering. The last decade has witnessed tremendous advancement of tensor computation and its applications in machine learning and big data. However, its hardware optimization on resource-constrained devices remains an (almost) unexplored field. This paper presents an hardware accelerator for a classical tensor computation framework, Tucker decomposition. We study three modules of this architecture: tensor-times-matrix (TTM), matrix singular value decomposition (SVD), and tensor permutation, and implemented them on Xilinx FPGA for prototyping. In order to further reduce the computing time, a warm-start algorithm for the Jacobi iterations in SVD is proposed. A fixed-point simulator is used to evaluate the performance of our design. Some synthetic data sets and a real MRI data set are used to validate the design and evaluate its performance. We compare our work with state-of-the-art software toolboxes running on both CPU and GPU, and our work shows 2.16 – 30.2× speedup on the cardiac MRI data set.

[1]  Tinoosh Mohsenin,et al.  Low-complexity FPGA implementation of compressive sensing reconstruction , 2013, 2013 International Conference on Computing, Networking and Communications (ICNC).

[2]  R. Sindhu Reddy,et al.  DLAU: A Scalable Deep Learning Accelerator Unit on FPGA , 2018 .

[3]  R. Brent,et al.  The Solution of Singular-Value and Symmetric Eigenvalue Problems on Multiprocessor Arrays , 1985 .

[4]  Joos Vandewalle,et al.  On the Best Rank-1 and Rank-(R1 , R2, ... , RN) Approximation of Higher-Order Tensors , 2000, SIAM J. Matrix Anal. Appl..

[5]  Bora Uçar,et al.  High Performance Parallel Algorithms for the Tucker Decomposition of Sparse Tensors , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[6]  Maja Pantic,et al.  TensorLy: Tensor Learning in Python , 2016, J. Mach. Learn. Res..

[7]  Alexander Novikov,et al.  Tensorizing Neural Networks , 2015, NIPS.

[8]  Ashraf A. Kassim,et al.  Dynamic MRI reconstruction using low rank plus sparse tensor decomposition , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[9]  Jimeng Sun,et al.  An input-adaptive and in-place approach to dense tensor-times-matrix multiply , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[11]  Xiu Yang,et al.  Enabling High-Dimensional Hierarchical Uncertainty Quantification by ANOVA and Tensor-Train Decomposition , 2014, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[12]  Mathews Jacob,et al.  Accelerated Dynamic MRI Exploiting Sparsity and Low-Rank Structure: k-t SLR , 2011, IEEE Transactions on Medical Imaging.

[13]  H. Neudecker,et al.  An approach ton-mode components analysis , 1986 .

[14]  Xinbo Gao,et al.  Robust tensor subspace learning for anomaly detection , 2011, Int. J. Mach. Learn. Cybern..

[15]  Demetri Terzopoulos,et al.  Multilinear image analysis for facial recognition , 2002, Object recognition supported by user interaction for service robots.

[16]  Jihan Zhu,et al.  FPGA Implementations of Neural Networks - A Survey of a Decade of Progress , 2003, FPL.

[17]  Yong Dou,et al.  64-bit floating-point FPGA matrix multiplication , 2005, FPGA '05.

[18]  Volker Tresp,et al.  Tensor-Train Recurrent Neural Networks for Video Classification , 2017, ICML.

[19]  Abbes Amira,et al.  Accelerating Matrix Product on Reconfigurable Hardware for Signal Processing , 2001, FPL.

[20]  Tamara G. Kolda,et al.  Categories and Subject Descriptors: G.4 [Mathematics of Computing]: Mathematical Software— , 2022 .

[21]  Ivan Oseledets,et al.  Tensor-Train Decomposition , 2011, SIAM J. Sci. Comput..

[22]  Joseph R. Cavallaro,et al.  A systolic VLSI architecture for complex SVD , 1992, [Proceedings] 1992 IEEE International Symposium on Circuits and Systems.

[23]  Narayanan Vijaykrishnan,et al.  A Hardware Efficient Support Vector Machine Architecture for FPGA , 2008, 2008 16th International Symposium on Field-Programmable Custom Computing Machines.

[24]  J. Leeuw,et al.  Principal component analysis of three-mode data by means of alternating least squares algorithms , 1980 .

[25]  James Demmel,et al.  Jacobi's Method is More Accurate than QR , 1989, SIAM J. Matrix Anal. Appl..

[26]  Abbes Amira,et al.  Improved SVD systolic array and implementation on FPGA , 2003, Proceedings. 2003 IEEE International Conference on Field-Programmable Technology (FPT) (IEEE Cat. No.03EX798).

[27]  Tamara G. Kolda,et al.  Scalable Tensor Decompositions for Multi-aspect Data Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[28]  Meng Cai,et al.  A Compact CNN-DBLSTM Based Character Model for Offline Handwriting Recognition with Tucker Decomposition , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[29]  Ivan V. Oseledets,et al.  Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition , 2014, ICLR.

[30]  L. Tucker,et al.  Some mathematical notes on three-mode factor analysis , 1966, Psychometrika.

[31]  Keikichi Hirose,et al.  One-to-Many Voice Conversion Based on Tensor Representation of Speaker Space , 2011, INTERSPEECH.

[32]  Jack E. Volder The CORDIC Trigonometric Computing Technique , 1959, IRE Trans. Electron. Comput..

[33]  Eldon R. Hansen,et al.  On Cyclic Jacobi Methods , 1963 .

[34]  Eunhyeok Park,et al.  Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications , 2015, ICLR.

[35]  Masih Rahmaty,et al.  FPGA based singular value decomposition for image processing applications , 2008, 2008 International Conference on Application-Specific Systems, Architectures and Processors.

[36]  Jen-Tzung Chien,et al.  Tensor-Factorized Neural Networks , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[37]  Tsui-Wei Weng,et al.  Big-Data Tensor Recovery for High-Dimensional Uncertainty Quantification of Process Variations , 2017, IEEE Transactions on Components, Packaging and Manufacturing Technology.

[38]  F. L. Hitchcock The Expression of a Tensor or a Polyadic as a Sum of Products , 1927 .

[39]  Nikos D. Sidiropoulos,et al.  Tensor Decomposition for Signal Processing and Machine Learning , 2016, IEEE Transactions on Signal Processing.

[40]  George Karypis,et al.  Sparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[41]  Zheng Zhang,et al.  Bayesian Tensorized Neural Networks with Automatic Rank Selection , 2019, Neurocomputing.