NeuMMU: Architectural Support for Efficient Address Translations in Neural Processing Units
暂无分享,去创建一个
John Kim | Yujeong Choi | Minsoo Rhu | Youngeun Kwon | Bongjoon Hyun | Minsoo Rhu | John Kim | Yujeong Choi | Youngeun Kwon | Bongjoon Hyun
[1] Eriko Nurvitadhi,et al. Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? , 2017, FPGA.
[2] Trevor N. Mudge,et al. A look at several memory management units, TLB-refill mechanisms, and page table organizations , 1998, ASPLOS VIII.
[3] Thomas A. Ziaja,et al. Sparc T4: A Dynamically Threaded Server-on-a-Chip , 2012, IEEE Micro.
[4] Vivienne Sze,et al. 14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks , 2016, ISSCC.
[5] No License,et al. Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .
[6] Eriko Nurvitadhi,et al. Accelerating Deep Convolutional Networks using low-precision and sparsity , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[7] G. Kandiraju,et al. Going the distance for TLB prefetching: an application-driven study , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.
[8] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[9] Rachata Ausavarungnirun,et al. Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[10] William J. Dally,et al. SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[11] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[12] Tao Zhang,et al. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[13] Rachata Ausavarungnirun,et al. MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency , 2018, ASPLOS.
[14] Martin D. Schatz,et al. Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications , 2018, ArXiv.
[15] Stephen W. Keckler,et al. Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks , 2017, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[16] Tianshi Chen,et al. ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[17] Abhishek Bhattacharjee,et al. Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces , 2014, ASPLOS.
[18] Jun Yang,et al. A Framework for Memory Oversubscription Management in Graphics Processing Units , 2019, ASPLOS.
[19] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[20] Gu-Yeon Wei,et al. 14.3 A 28nm SoC with a 1.2GHz 568nJ/prediction sparse deep-neural-network engine with >0.1 timing error rate tolerance for IoT applications , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).
[21] Patrick Judd,et al. Stripes: Bit-serial deep neural network computing , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[22] Jia Wang,et al. DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[23] Lei Zhang,et al. Neuromorphic accelerators: A comparison between neuroscience and machine-learning approaches , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[24] Nam Sung Kim,et al. GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.
[25] Yujeong Choi,et al. PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units , 2019, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[26] Per Hammarlund,et al. 4th generation Intel™ Core processor, codenamed Haswell , 2013, 2013 IEEE Hot Chips 25 Symposium (HCS).
[27] Yinghai Lu,et al. Deep Learning Recommendation Model for Personalization and Recommendation Systems , 2019, ArXiv.
[28] Wen-mei W. Hwu,et al. Heterogeneous System Architecture: A New Compute Platform Infrastructure , 2015 .
[29] Seth H. Pugsley,et al. USIMM : the Utah SImulated Memory Module , 2012 .
[30] Sudhakar Yalamanchili,et al. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[31] Andreas Moshovos,et al. Bit-Pragmatic Deep Neural Network Computing , 2016, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[32] Newsha Ardalani,et al. Beyond human-level accuracy: computational challenges in deep learning , 2019, PPoPP.
[33] Natalie D. Enright Jerger,et al. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[34] Abhishek Bhattacharjee,et al. Large-reach memory management unit caches , 2013, MICRO.
[35] David W. Nellans,et al. Towards high performance paged memory for GPUs , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[36] Eriko Nurvitadhi,et al. A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA Platform: A Deep Learning Case Study , 2018, FPGA.
[37] Bruce Jacob,et al. DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.
[38] Carole-Jean Wu,et al. The Architectural Implications of Facebook's DNN-Based Personalized Recommendation , 2019, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[39] Xuehai Zhou,et al. PuDianNao: A Polyvalent Machine Learning Accelerator , 2015, ASPLOS.
[40] Margaret Martonosi,et al. Shared last-level TLBs for chip multiprocessors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[41] Hadi Esmaeilzadeh,et al. TABLA: A unified template-based framework for accelerating statistical machine learning , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[42] Alex Graves,et al. Neural Turing Machines , 2014, ArXiv.
[43] Tat-Seng Chua,et al. Neural Collaborative Filtering , 2017, WWW.
[44] Dong Han,et al. Cambricon: An Instruction Set Architecture for Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[45] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.
[46] Miao Hu,et al. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[47] Shaoli Liu,et al. Cambricon-X: An accelerator for sparse neural networks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[48] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[49] Yuan Gao,et al. RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[50] Natalie D. Enright Jerger,et al. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[51] Alex Krizhevsky,et al. One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.
[52] Aamer Jaleel,et al. Beyond the Socket: NUMA-Aware GPUs , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[53] Eunhyeok Park,et al. Energy-Efficient Neural Network Accelerator Based on Outlier-Aware Low-Precision Computation , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).
[54] Song Han,et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[55] Minsoo Rhu,et al. A Disaggregated Memory System for Deep Learning , 2019, IEEE Micro.
[56] Youngjin Kwon,et al. Coordinated and Efficient Huge Page Management with Ingens , 2016, OSDI.
[57] Stephen W. Keckler,et al. Page Placement Strategies for GPUs within Heterogeneous Memory Systems , 2015, ASPLOS.
[58] Natalia Gimelshein,et al. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[59] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[60] Asit K. Mishra,et al. From high-level deep neural models to FPGAs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[61] Minsoo Rhu,et al. A Case for Memory-Centric HPC System Architecture for Training Deep Neural Networks , 2018, IEEE Computer Architecture Letters.
[62] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[63] Minsoo Rhu,et al. TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning , 2019, MICRO.
[64] Alan L. Cox,et al. Translation caching: skip, don't walk (the page table) , 2010, ISCA.
[65] Michele Ruggiero Banish,et al. Neural network processor , 2004, SPIE Optics + Photonics.
[66] Yonghui Wu,et al. Exploring the Limits of Language Modeling , 2016, ArXiv.
[67] Gu-Yeon Wei,et al. Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[68] Jason Cong,et al. Supporting Address Translation for Accelerator-Centric Architectures , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[69] Alberto Delmas,et al. Bit-Tactical: Exploiting Ineffectual Computations in Convolutional Neural Networks: Which, Why, and How , 2018, ArXiv.
[70] Eriko Nurvitadhi,et al. High performance binary neural networks on the Xeon+FPGA™ platform , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).
[71] David A. Wood,et al. Supporting x86-64 address translation for 100s of GPU lanes , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[72] Christoforos E. Kozyrakis,et al. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.
[73] Minsoo Rhu,et al. Beyond the Memory Wall: A Case for Memory-Centric HPC System for Deep Learning , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[74] Per Stenström,et al. Recency-based TLB preloading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[75] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.