Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories
暂无分享,去创建一个
Norman A. Rink | Jeronimo Castrillon | Fazal Hameed | Asif Ali Khan | Norman Alexander Rink | A. Khan | F. Hameed | J. Castrillón
[1] Yuefan Deng,et al. New trends in high performance computing , 2001, Parallel Computing.
[2] William J. Dally,et al. Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[3] Jeronimo Castrillon,et al. RTSim: A Cycle-Accurate Simulator for Racetrack Memories , 2019, IEEE Computer Architecture Letters.
[4] Devin Matthews,et al. High-Performance Tensor Contraction without BLAS , 2016, ArXiv.
[5] Jack J. Dongarra,et al. Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..
[6] Kuei-Hung Shen,et al. Racetrack Memory: A high-performance, low-cost, non-volatile memory based on magnetic domain walls , 2011, 2011 International Electron Devices Meeting.
[7] H.-S. Philip Wong,et al. Phase Change Memory , 2010, Proceedings of the IEEE.
[8] Edwin Hsing-Mean Sha,et al. Efficient Data Placement for Improving Data Access Performance on Domain-Wall Memory , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.
[9] Ehsan Atoofian,et al. Reducing shift penalty in Domain Wall Memory through register locality , 2015, 2015 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).
[10] Sriram Krishnamoorthy,et al. A Code Generator for High-Performance Tensor Contractions on GPUs , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[11] Jeronimo Castrillon,et al. Generalized Data Placement Strategies for Racetrack Memories , 2020, 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE).
[12] Stuart Parkin,et al. Memory on the racetrack. , 2015, Nature nanotechnology.
[13] Virginia Vassilevska Williams,et al. Multiplying matrices faster than coppersmith-winograd , 2012, STOC '12.
[14] Razvan Pascanu,et al. Theano: A CPU and GPU Math Compiler in Python , 2010, SciPy.
[15] Chirag Garg,et al. Magnetic Racetrack Memory: From Physics to the Cusp of Applications Within a Decade , 2020, Proceedings of the IEEE.
[16] Yiran Chen,et al. An Energy-Efficient GPGPU Register File Architecture Using Racetrack Memory , 2017, IEEE Transactions on Computers.
[17] Kaushik Roy,et al. TapeCache: a high density, energy efficient cache based on domain wall memory , 2012, ISLPED '12.
[18] Meng Zhang,et al. Shift-Optimized Energy-Efficient Racetrack-Based Main Memory , 2018, J. Circuits Syst. Comput..
[19] Steven S. Muchnick,et al. Advanced Compiler Design and Implementation , 1997 .
[20] Don Coppersmith,et al. Matrix multiplication via arithmetic progressions , 1987, STOC.
[21] Jeronimo Castrillon,et al. SHRIMP: Efficient Instruction Delivery with Domain Wall Memory , 2019, 2019 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).
[22] David E. Bernholdt,et al. Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models , 2005, Proceedings of the IEEE.
[23] Wei-Che Tseng,et al. Data Allocation Optimization for Hybrid Scratch Pad Memory With SRAM and Nonvolatile Memory , 2013, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.
[24] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .
[25] Alfred V. Aho,et al. Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.
[26] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.
[27] Mithuna Thottethodi,et al. Recursive Array Layouts and Fast Matrix Multiplication , 2002, IEEE Trans. Parallel Distributed Syst..
[28] Mahmut T. Kandemir,et al. Banked scratch-pad memory management for reducing leakage energy consumption , 2004, IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004..
[29] Jeffrey S. Vetter,et al. DESTINY: A Comprehensive Tool with 3D and Multi-Level Cell Memory Modeling Capability , 2017 .
[30] Jörg Stiller,et al. CFDlang: High-level code generation for high-order methods in fluid dynamics , 2018, RWDSL2018.
[31] Shoaib Kamil,et al. The tensor algebra compiler , 2017, Proc. ACM Program. Lang..
[32] Michael Kruse,et al. High-Performance Generalized Tensor Operations , 2018, ACM Trans. Archit. Code Optim..
[33] Mahmut T. Kandemir,et al. Dynamic management of scratch-pad memory space , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).
[34] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.
[35] Markus Püschel,et al. A basic linear algebra compiler for embedded processors , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).
[36] Dong Li,et al. A Survey Of Architectural Approaches for Managing Embedded DRAM and Non-Volatile On-Chip Caches , 2015, IEEE Transactions on Parallel and Distributed Systems.
[37] Peng Zhang,et al. Matrix Multiplication on High-Density Multi-GPU Architectures: Theoretical and Experimental Investigations , 2015, ISC.
[38] Norman A. Rink,et al. Meta-programming for cross-domain tensor optimizations , 2018, GPCE.
[39] Daniele G. Spampinato,et al. A basic linear algebra compiler for structured matrices , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[40] Kees G. W. Goossens,et al. Memory-map selection for firm real-time SDRAM controllers , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).
[41] Jeronimo Castrillon,et al. Performance and Energy-Efficient Design of STT-RAM Last-Level Cache , 2018, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.
[42] Rami G. Melhem,et al. FusedCache: A Naturally Inclusive, Racetrack Memory, Dual-Level Private Cache , 2016, IEEE Transactions on Multi-Scale Computing Systems.
[43] Mahmut T. Kandemir,et al. Compiler-guided leakage optimization for banked scratch-pad memories , 2005, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.
[44] Wenqing Wu,et al. Cross-layer racetrack memory design for ultra high density and low power consumption , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).
[45] Robert A. van de Geijn,et al. A Family of High-Performance Matrix Multiplication Algorithms , 2001, International Conference on Computational Science.
[46] Viktor K. Prasanna,et al. Eecient Matrix Multiplication Using Cache Conscious Data Layouts , 2000 .
[47] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.
[48] S. Parkin,et al. Magnetic Domain-Wall Racetrack Memory , 2008, Science.
[49] Zhu Wang,et al. Endurance-Aware Allocation of Data Variables on NVM-Based Scratchpad Memory in Real-Time Embedded Systems , 2015, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
[50] Hai Li,et al. Quantitative modeling of racetrack memory, a tradeoff among area, performance, and power , 2015, The 20th Asia and South Pacific Design Automation Conference.
[51] Kaushik Roy,et al. STAG: Spintronic-Tape Architecture for GPGPU cache hierarchies , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).
[52] Paolo Bientinesi,et al. Design of a High-Performance GEMM-like Tensor–Tensor Multiplication , 2016, ACM Trans. Math. Softw..
[53] Jerónimo Castrillón,et al. ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0 , 2019, ACM Trans. Archit. Code Optim..
[54] Albert Cohen,et al. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions , 2018, ArXiv.
[55] Franz Franchetti,et al. SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.
[56] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.
[57] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.
[58] Keshav Pingali,et al. High-level semantic optimization of numerical codes , 1999, ICS '99.
[59] Gerhard Fettweis,et al. A Hardware/Software Stack for Heterogeneous Systems , 2018, IEEE Transactions on Multi-Scale Computing Systems.
[60] Charles L. Lawson,et al. Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.
[61] Takahiro Katagiri,et al. Parallel Processing of Matrix Multiplication in a CPU and GPU Heterogeneous Environment , 2006, VECPAR.
[62] Jack J. Dongarra,et al. Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor , 2009, Parallel Comput..
[63] Norman A. Rink,et al. Optimizing tensor contractions for embedded devices with racetrack memory scratch-pads , 2019, LCTES.