论文信息 - Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories

Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories

Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip/off-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance- and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM) and DRAM-based off-chip memory. Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Optimizations for off-chip memory such as memory access order, data mapping and the choice of a suitable memory access granularity are employed to reduce the contention in the off-chip memory. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 32% and 73%, respectively, compared to an iso-capacity SRAM. The overall DRAM dynamic energy consumption improvements due to memory optimizations amount to 80%.

[1] Yuefan Deng,et al. New trends in high performance computing , 2001, Parallel Computing.

[2] William J. Dally,et al. Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[3] Jeronimo Castrillon,et al. RTSim: A Cycle-Accurate Simulator for Racetrack Memories , 2019, IEEE Computer Architecture Letters.

[4] Devin Matthews,et al. High-Performance Tensor Contraction without BLAS , 2016, ArXiv.

[5] Jack J. Dongarra,et al. Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[6] Kuei-Hung Shen,et al. Racetrack Memory: A high-performance, low-cost, non-volatile memory based on magnetic domain walls , 2011, 2011 International Electron Devices Meeting.

[7] H.-S. Philip Wong,et al. Phase Change Memory , 2010, Proceedings of the IEEE.

[8] Edwin Hsing-Mean Sha,et al. Efficient Data Placement for Improving Data Access Performance on Domain-Wall Memory , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[9] Ehsan Atoofian,et al. Reducing shift penalty in Domain Wall Memory through register locality , 2015, 2015 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).

[10] Sriram Krishnamoorthy,et al. A Code Generator for High-Performance Tensor Contractions on GPUs , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[11] Jeronimo Castrillon,et al. Generalized Data Placement Strategies for Racetrack Memories , 2020, 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[12] Stuart Parkin,et al. Memory on the racetrack. , 2015, Nature nanotechnology.

[13] Virginia Vassilevska Williams,et al. Multiplying matrices faster than coppersmith-winograd , 2012, STOC '12.

[14] Razvan Pascanu,et al. Theano: A CPU and GPU Math Compiler in Python , 2010, SciPy.

[15] Chirag Garg,et al. Magnetic Racetrack Memory: From Physics to the Cusp of Applications Within a Decade , 2020, Proceedings of the IEEE.

[16] Yiran Chen,et al. An Energy-Efficient GPGPU Register File Architecture Using Racetrack Memory , 2017, IEEE Transactions on Computers.

[17] Kaushik Roy,et al. TapeCache: a high density, energy efficient cache based on domain wall memory , 2012, ISLPED '12.

[18] Meng Zhang,et al. Shift-Optimized Energy-Efficient Racetrack-Based Main Memory , 2018, J. Circuits Syst. Comput..

[19] Steven S. Muchnick,et al. Advanced Compiler Design and Implementation , 1997 .

[20] Don Coppersmith,et al. Matrix multiplication via arithmetic progressions , 1987, STOC.

[21] Jeronimo Castrillon,et al. SHRIMP: Efficient Instruction Delivery with Domain Wall Memory , 2019, 2019 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[22] David E. Bernholdt,et al. Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models , 2005, Proceedings of the IEEE.

[23] Wei-Che Tseng,et al. Data Allocation Optimization for Hybrid Scratch Pad Memory With SRAM and Nonvolatile Memory , 2013, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[24] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[25] Alfred V. Aho,et al. Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[26] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[27] Mithuna Thottethodi,et al. Recursive Array Layouts and Fast Matrix Multiplication , 2002, IEEE Trans. Parallel Distributed Syst..

[28] Mahmut T. Kandemir,et al. Banked scratch-pad memory management for reducing leakage energy consumption , 2004, IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004..

[29] Jeffrey S. Vetter,et al. DESTINY: A Comprehensive Tool with 3D and Multi-Level Cell Memory Modeling Capability , 2017 .

[30] Jörg Stiller,et al. CFDlang: High-level code generation for high-order methods in fluid dynamics , 2018, RWDSL2018.

[31] Shoaib Kamil,et al. The tensor algebra compiler , 2017, Proc. ACM Program. Lang..

[32] Michael Kruse,et al. High-Performance Generalized Tensor Operations , 2018, ACM Trans. Archit. Code Optim..

[33] Mahmut T. Kandemir,et al. Dynamic management of scratch-pad memory space , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[34] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[35] Markus Püschel,et al. A basic linear algebra compiler for embedded processors , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[36] Dong Li,et al. A Survey Of Architectural Approaches for Managing Embedded DRAM and Non-Volatile On-Chip Caches , 2015, IEEE Transactions on Parallel and Distributed Systems.

[37] Peng Zhang,et al. Matrix Multiplication on High-Density Multi-GPU Architectures: Theoretical and Experimental Investigations , 2015, ISC.

[38] Norman A. Rink,et al. Meta-programming for cross-domain tensor optimizations , 2018, GPCE.

[39] Daniele G. Spampinato,et al. A basic linear algebra compiler for structured matrices , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[40] Kees G. W. Goossens,et al. Memory-map selection for firm real-time SDRAM controllers , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[41] Jeronimo Castrillon,et al. Performance and Energy-Efficient Design of STT-RAM Last-Level Cache , 2018, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[42] Rami G. Melhem,et al. FusedCache: A Naturally Inclusive, Racetrack Memory, Dual-Level Private Cache , 2016, IEEE Transactions on Multi-Scale Computing Systems.

[43] Mahmut T. Kandemir,et al. Compiler-guided leakage optimization for banked scratch-pad memories , 2005, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[44] Wenqing Wu,et al. Cross-layer racetrack memory design for ultra high density and low power consumption , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[45] Robert A. van de Geijn,et al. A Family of High-Performance Matrix Multiplication Algorithms , 2001, International Conference on Computational Science.

[46] Viktor K. Prasanna,et al. Eecient Matrix Multiplication Using Cache Conscious Data Layouts , 2000 .

[47] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[48] S. Parkin,et al. Magnetic Domain-Wall Racetrack Memory , 2008, Science.

[49] Zhu Wang,et al. Endurance-Aware Allocation of Data Variables on NVM-Based Scratchpad Memory in Real-Time Embedded Systems , 2015, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[50] Hai Li,et al. Quantitative modeling of racetrack memory, a tradeoff among area, performance, and power , 2015, The 20th Asia and South Pacific Design Automation Conference.

[51] Kaushik Roy,et al. STAG: Spintronic-Tape Architecture for GPGPU cache hierarchies , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[52] Paolo Bientinesi,et al. Design of a High-Performance GEMM-like Tensor–Tensor Multiplication , 2016, ACM Trans. Math. Softw..

[53] Jerónimo Castrillón,et al. ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0 , 2019, ACM Trans. Archit. Code Optim..

[54] Albert Cohen,et al. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions , 2018, ArXiv.

[55] Franz Franchetti,et al. SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[56] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.

[57] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[58] Keshav Pingali,et al. High-level semantic optimization of numerical codes , 1999, ICS '99.

[59] Gerhard Fettweis,et al. A Hardware/Software Stack for Heterogeneous Systems , 2018, IEEE Transactions on Multi-Scale Computing Systems.

[60] Charles L. Lawson,et al. Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[61] Takahiro Katagiri,et al. Parallel Processing of Matrix Multiplication in a CPU and GPU Heterogeneous Environment , 2006, VECPAR.

[62] Jack J. Dongarra,et al. Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor , 2009, Parallel Comput..

[63] Norman A. Rink,et al. Optimizing tensor contractions for embedded devices with racetrack memory scratch-pads , 2019, LCTES.