Optimizing tensor contractions for embedded devices with racetrack memory scratch-pads
暂无分享,去创建一个
Norman A. Rink | Jerónimo Castrillón | Fazal Hameed | Asif Ali Khan | A. Khan | F. Hameed | J. Castrillón
[1] Meng Zhang,et al. Shift-Optimized Energy-Efficient Racetrack-Based Main Memory , 2018, J. Circuits Syst. Comput..
[2] Jiwu Shu,et al. Exploring data placement in racetrack memory based scratchpad memory , 2015, 2015 IEEE Non-Volatile Memory System and Applications Symposium (NVMSA).
[3] Mahmut T. Kandemir,et al. Evaluating STT-RAM as an energy-efficient main memory alternative , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[4] Wei Wang,et al. An Optimized Matrix Multiplication on ARMv7 Architecture , 2012 .
[5] Steven S. Muchnick,et al. Advanced Compiler Design and Implementation , 1997 .
[6] S. Parkin,et al. Magnetic Domain-Wall Racetrack Memory , 2008, Science.
[7] Zhu Wang,et al. Endurance-Aware Allocation of Data Variables on NVM-Based Scratchpad Memory in Real-Time Embedded Systems , 2015, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
[8] Hai Li,et al. Quantitative modeling of racetrack memory, a tradeoff among area, performance, and power , 2015, The 20th Asia and South Pacific Design Automation Conference.
[9] S. Parkin,et al. Domain-wall velocities of up to 750 m s(-1) driven by exchange-coupling torque in synthetic antiferromagnets. , 2015, Nature nanotechnology.
[10] Ehsan Atoofian,et al. Reducing shift penalty in Domain Wall Memory through register locality , 2015, 2015 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).
[11] Mahmut T. Kandemir,et al. Dynamic management of scratch-pad memory space , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).
[12] Keshav Pingali,et al. High-level semantic optimization of numerical codes , 1999, ICS '99.
[13] Rajeev Barua,et al. Dynamic allocation for scratch-pad memory using compile-time decisions , 2006, TECS.
[14] Wei-Che Tseng,et al. Data Allocation Optimization for Hybrid Scratch Pad Memory With SRAM and Nonvolatile Memory , 2013, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.
[15] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.
[16] Wenqing Wu,et al. Cross-layer racetrack memory design for ultra high density and low power consumption , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).
[17] Wang,et al. In-Datacenter Performance Analysis of a Tensor Processing UnitTM , .
[18] L. Buda-Prejbeanu,et al. Fast current-induced domain-wall motion controlled by the Rashba effect. , 2011, Nature materials.
[19] Yiran Chen,et al. An Energy-Efficient GPGPU Register File Architecture Using Racetrack Memory , 2017, IEEE Transactions on Computers.
[20] Kaushik Roy,et al. TapeCache: a high density, energy efficient cache based on domain wall memory , 2012, ISLPED '12.
[21] Edwin Hsing-Mean Sha,et al. Efficient Data Placement for Improving Data Access Performance on Domain-Wall Memory , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.
[22] Mahmut T. Kandemir,et al. Banked scratch-pad memory management for reducing leakage energy consumption , 2004, IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004..
[23] Daniele G. Spampinato,et al. A basic linear algebra compiler for structured matrices , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[24] Nikil D. Dutt,et al. Efficient utilization of scratch-pad memory in embedded processor applications , 1997, Proceedings European Design and Test Conference. ED & TC 97.
[25] Jörg Stiller,et al. CFDlang: High-level code generation for high-order methods in fluid dynamics , 2018, RWDSL2018.
[26] Paolo Bientinesi,et al. Design of a High-Performance GEMM-like Tensor–Tensor Multiplication , 2016, ACM Trans. Math. Softw..
[27] Shoaib Kamil,et al. The tensor algebra compiler , 2017, Proc. ACM Program. Lang..
[28] Robert A. van de Geijn,et al. A Family of High-Performance Matrix Multiplication Algorithms , 2001, International Conference on Computational Science.
[29] Peter Marwedel,et al. Scratchpad memory: a design alternative for cache on-chip memory in embedded systems , 2002, Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002 (IEEE Cat. No.02TH8627).
[30] Jerónimo Castrillón,et al. ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0 , 2019, ACM Trans. Archit. Code Optim..
[31] Albert Cohen,et al. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions , 2018, ArXiv.
[32] Michael Kruse,et al. High-Performance Generalized Tensor Operations , 2018, ACM Trans. Archit. Code Optim..
[33] Stuart Parkin,et al. Memory on the racetrack. , 2015, Nature nanotechnology.
[34] Franz Franchetti,et al. SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.
[35] Don Coppersmith,et al. Matrix multiplication via arithmetic progressions , 1987, STOC.
[36] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.
[37] H.-S. Philip Wong,et al. Phase Change Memory , 2010, Proceedings of the IEEE.
[38] Norman A. Rink,et al. Meta-programming for cross-domain tensor optimizations , 2018, GPCE.
[39] Sriram Krishnamoorthy,et al. A Code Generator for High-Performance Tensor Contractions on GPUs , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[40] Mary Jane Irwin,et al. Banked scratch-pad memory management for reducing leakage energy consumption , 2004, ICCAD 2004.
[41] Virginia Vassilevska Williams,et al. Multiplying matrices faster than coppersmith-winograd , 2012, STOC '12.
[42] Charles L. Lawson,et al. Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.
[43] Takahiro Katagiri,et al. Parallel Processing of Matrix Multiplication in a CPU and GPU Heterogeneous Environment , 2006, VECPAR.
[44] Jack J. Dongarra,et al. Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor , 2009, Parallel Comput..
[45] David E. Bernholdt,et al. Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models , 2005, Proceedings of the IEEE.
[46] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .
[47] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.
[48] Dong Li,et al. A Survey Of Architectural Approaches for Managing Embedded DRAM and Non-Volatile On-Chip Caches , 2015, IEEE Transactions on Parallel and Distributed Systems.
[49] Jeffrey S. Vetter,et al. DESTINY: A Comprehensive Tool with 3D and Multi-Level Cell Memory Modeling Capability , 2017 .
[50] Kuei-Hung Shen,et al. Racetrack Memory: A high-performance, low-cost, non-volatile memory based on magnetic domain walls , 2011, 2011 International Electron Devices Meeting.
[51] Taejoon Park,et al. Energy-Efficient Approximate Multiplication for Digital Signal Processing and Classification Applications , 2015, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.
[52] Markus Püschel,et al. A basic linear algebra compiler for embedded processors , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).
[53] Peng Zhang,et al. Matrix Multiplication on High-Density Multi-GPU Architectures: Theoretical and Experimental Investigations , 2015, ISC.
[54] Jeronimo Castrillon,et al. Performance and Energy-Efficient Design of STT-RAM Last-Level Cache , 2018, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.
[55] Mahmut T. Kandemir,et al. Compiler-guided leakage optimization for banked scratch-pad memories , 2005, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.
[56] Siddharth Joshi,et al. FPGA Based High Performance Double-Precision Matrix Multiplication , 2009, 2009 22nd International Conference on VLSI Design.
[57] Jeronimo Castrillon,et al. RTSim: A Cycle-Accurate Simulator for Racetrack Memories , 2019, IEEE Computer Architecture Letters.
[58] Devin Matthews,et al. High-Performance Tensor Contraction without BLAS , 2016, ArXiv.
[59] Jack J. Dongarra,et al. Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..
[60] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.
[61] Yu Wang,et al. Performance-centric register file design for GPUs using racetrack memory , 2016, 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC).