Optimizing tensor contractions for embedded devices with racetrack memory scratch-pads

Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance- and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM). Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 24% and 74% respectively compared to an iso-capacity SRAM.

[1]  Meng Zhang,et al.  Shift-Optimized Energy-Efficient Racetrack-Based Main Memory , 2018, J. Circuits Syst. Comput..

[2]  Jiwu Shu,et al.  Exploring data placement in racetrack memory based scratchpad memory , 2015, 2015 IEEE Non-Volatile Memory System and Applications Symposium (NVMSA).

[3]  Mahmut T. Kandemir,et al.  Evaluating STT-RAM as an energy-efficient main memory alternative , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[4]  Wei Wang,et al.  An Optimized Matrix Multiplication on ARMv7 Architecture , 2012 .

[5]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[6]  S. Parkin,et al.  Magnetic Domain-Wall Racetrack Memory , 2008, Science.

[7]  Zhu Wang,et al.  Endurance-Aware Allocation of Data Variables on NVM-Based Scratchpad Memory in Real-Time Embedded Systems , 2015, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[8]  Hai Li,et al.  Quantitative modeling of racetrack memory, a tradeoff among area, performance, and power , 2015, The 20th Asia and South Pacific Design Automation Conference.

[9]  S. Parkin,et al.  Domain-wall velocities of up to 750 m s(-1) driven by exchange-coupling torque in synthetic antiferromagnets. , 2015, Nature nanotechnology.

[10]  Ehsan Atoofian,et al.  Reducing shift penalty in Domain Wall Memory through register locality , 2015, 2015 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).

[11]  Mahmut T. Kandemir,et al.  Dynamic management of scratch-pad memory space , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[12]  Keshav Pingali,et al.  High-level semantic optimization of numerical codes , 1999, ICS '99.

[13]  Rajeev Barua,et al.  Dynamic allocation for scratch-pad memory using compile-time decisions , 2006, TECS.

[14]  Wei-Che Tseng,et al.  Data Allocation Optimization for Hybrid Scratch Pad Memory With SRAM and Nonvolatile Memory , 2013, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[15]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[16]  Wenqing Wu,et al.  Cross-layer racetrack memory design for ultra high density and low power consumption , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[17]  Wang,et al.  In-Datacenter Performance Analysis of a Tensor Processing UnitTM , .

[18]  L. Buda-Prejbeanu,et al.  Fast current-induced domain-wall motion controlled by the Rashba effect. , 2011, Nature materials.

[19]  Yiran Chen,et al.  An Energy-Efficient GPGPU Register File Architecture Using Racetrack Memory , 2017, IEEE Transactions on Computers.

[20]  Kaushik Roy,et al.  TapeCache: a high density, energy efficient cache based on domain wall memory , 2012, ISLPED '12.

[21]  Edwin Hsing-Mean Sha,et al.  Efficient Data Placement for Improving Data Access Performance on Domain-Wall Memory , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[22]  Mahmut T. Kandemir,et al.  Banked scratch-pad memory management for reducing leakage energy consumption , 2004, IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004..

[23]  Daniele G. Spampinato,et al.  A basic linear algebra compiler for structured matrices , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[24]  Nikil D. Dutt,et al.  Efficient utilization of scratch-pad memory in embedded processor applications , 1997, Proceedings European Design and Test Conference. ED & TC 97.

[25]  Jörg Stiller,et al.  CFDlang: High-level code generation for high-order methods in fluid dynamics , 2018, RWDSL2018.

[26]  Paolo Bientinesi,et al.  Design of a High-Performance GEMM-like Tensor–Tensor Multiplication , 2016, ACM Trans. Math. Softw..

[27]  Shoaib Kamil,et al.  The tensor algebra compiler , 2017, Proc. ACM Program. Lang..

[28]  Robert A. van de Geijn,et al.  A Family of High-Performance Matrix Multiplication Algorithms , 2001, International Conference on Computational Science.

[29]  Peter Marwedel,et al.  Scratchpad memory: a design alternative for cache on-chip memory in embedded systems , 2002, Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002 (IEEE Cat. No.02TH8627).

[30]  Jerónimo Castrillón,et al.  ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0 , 2019, ACM Trans. Archit. Code Optim..

[31]  Albert Cohen,et al.  Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions , 2018, ArXiv.

[32]  Michael Kruse,et al.  High-Performance Generalized Tensor Operations , 2018, ACM Trans. Archit. Code Optim..

[33]  Stuart Parkin,et al.  Memory on the racetrack. , 2015, Nature nanotechnology.

[34]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[35]  Don Coppersmith,et al.  Matrix multiplication via arithmetic progressions , 1987, STOC.

[36]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[37]  H.-S. Philip Wong,et al.  Phase Change Memory , 2010, Proceedings of the IEEE.

[38]  Norman A. Rink,et al.  Meta-programming for cross-domain tensor optimizations , 2018, GPCE.

[39]  Sriram Krishnamoorthy,et al.  A Code Generator for High-Performance Tensor Contractions on GPUs , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[40]  Mary Jane Irwin,et al.  Banked scratch-pad memory management for reducing leakage energy consumption , 2004, ICCAD 2004.

[41]  Virginia Vassilevska Williams,et al.  Multiplying matrices faster than coppersmith-winograd , 2012, STOC '12.

[42]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[43]  Takahiro Katagiri,et al.  Parallel Processing of Matrix Multiplication in a CPU and GPU Heterogeneous Environment , 2006, VECPAR.

[44]  Jack J. Dongarra,et al.  Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor , 2009, Parallel Comput..

[45]  David E. Bernholdt,et al.  Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models , 2005, Proceedings of the IEEE.

[46]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[47]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[48]  Dong Li,et al.  A Survey Of Architectural Approaches for Managing Embedded DRAM and Non-Volatile On-Chip Caches , 2015, IEEE Transactions on Parallel and Distributed Systems.

[49]  Jeffrey S. Vetter,et al.  DESTINY: A Comprehensive Tool with 3D and Multi-Level Cell Memory Modeling Capability , 2017 .

[50]  Kuei-Hung Shen,et al.  Racetrack Memory: A high-performance, low-cost, non-volatile memory based on magnetic domain walls , 2011, 2011 International Electron Devices Meeting.

[51]  Taejoon Park,et al.  Energy-Efficient Approximate Multiplication for Digital Signal Processing and Classification Applications , 2015, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[52]  Markus Püschel,et al.  A basic linear algebra compiler for embedded processors , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[53]  Peng Zhang,et al.  Matrix Multiplication on High-Density Multi-GPU Architectures: Theoretical and Experimental Investigations , 2015, ISC.

[54]  Jeronimo Castrillon,et al.  Performance and Energy-Efficient Design of STT-RAM Last-Level Cache , 2018, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[55]  Mahmut T. Kandemir,et al.  Compiler-guided leakage optimization for banked scratch-pad memories , 2005, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[56]  Siddharth Joshi,et al.  FPGA Based High Performance Double-Precision Matrix Multiplication , 2009, 2009 22nd International Conference on VLSI Design.

[57]  Jeronimo Castrillon,et al.  RTSim: A Cycle-Accurate Simulator for Racetrack Memories , 2019, IEEE Computer Architecture Letters.

[58]  Devin Matthews,et al.  High-Performance Tensor Contraction without BLAS , 2016, ArXiv.

[59]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[60]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[61]  Yu Wang,et al.  Performance-centric register file design for GPUs using racetrack memory , 2016, 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC).