Beyond the Memory Wall: A Case for Memory-Centric HPC System for Deep Learning
暂无分享,去创建一个
[1] Ting Yao,et al. Deep Learning for Video Classification and Captioning , 2016, Frontiers of Multimedia Research.
[2] Sanja Fidler,et al. MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[3] Xuehai Zhou,et al. PuDianNao: A Polyvalent Machine Learning Accelerator , 2015, ASPLOS.
[4] Amar Phanishayee,et al. Benchmarking and Analyzing Deep Neural Network Training , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).
[5] Gunhee Kim,et al. SplitNet: Learning to Semantically Split Deep Networks for Parameter Reduction and Model Parallelization , 2017, ICML.
[6] Song Han,et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[7] Wei Xu,et al. Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[8] Patrick Judd,et al. Stripes: Bit-serial deep neural network computing , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[9] Natalie D. Enright Jerger,et al. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[10] Mehrzad Samadi,et al. Memory-centric system interconnect design with hybrid memory cubes , 2013, PACT 2013.
[11] Dong Han,et al. Cambricon: An Instruction Set Architecture for Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[12] Jaejin Lee,et al. Transparent GPU memory management for DNNs , 2018, PPOPP.
[13] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[14] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.
[15] Eriko Nurvitadhi,et al. Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? , 2017, FPGA.
[16] Yoshua Bengio,et al. Dynamic Neural Turing Machine with Soft and Hard Addressing Schemes , 2016, ArXiv.
[17] Natalie D. Enright Jerger,et al. Modular Routing Design for Chiplet-Based Systems , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).
[18] James Demmel,et al. ImageNet Training in Minutes , 2017, ICPP.
[19] Robert A. van de Geijn,et al. Collective communication on architectures that support simultaneous communication over multiple links , 2006, PPoPP '06.
[20] Takuya Akiba,et al. Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes , 2017, ArXiv.
[21] Lei Zhang,et al. Neuromorphic accelerators: A comparison between neuroscience and machine-learning approaches , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[22] Hadi Esmaeilzadeh,et al. Scale-Out Acceleration for Machine Learning , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[23] David W. Nellans,et al. Towards high performance paged memory for GPUs , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[24] Kilian Q. Weinberger,et al. Deep Networks with Stochastic Depth , 2016, ECCV.
[25] John Kim,et al. Multi-GPU System Design with Memory Networks , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[26] Onur Mutlu,et al. Ramulator: A Fast and Extensible DRAM Simulator , 2016, IEEE Computer Architecture Letters.
[27] Jaeha Kim,et al. Memory-centric system interconnect design with Hybrid Memory Cubes , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.
[28] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[29] Juan Carlos Niebles,et al. Leveraging Video Descriptions to Learn Video Question Answering , 2016, AAAI.
[30] Natalie D. Enright Jerger,et al. Enabling interposer-based disintegration of multi-core processors , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[31] Eriko Nurvitadhi,et al. A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA Platform: A Deep Learning Case Study , 2018, FPGA.
[32] Zheng Zhang,et al. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.
[33] Bruce Jacob,et al. DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.
[34] Seth H. Pugsley,et al. USIMM : the Utah SImulated Memory Module , 2012 .
[35] Andreas Moshovos,et al. Bit-Pragmatic Deep Neural Network Computing , 2016, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[36] Christoforos E. Kozyrakis,et al. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.
[37] Gunhee Kim,et al. A Read-Write Memory Network for Movie Story Understanding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[38] Gu-Yeon Wei,et al. Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[39] Alberto Delmas,et al. Bit-Tactical: Exploiting Ineffectual Computations in Convolutional Neural Networks: Which, Why, and How , 2018, ArXiv.
[40] Eriko Nurvitadhi,et al. High performance binary neural networks on the Xeon+FPGA™ platform , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).
[41] Chen Meng,et al. Training Deeper Models by GPU Memory Optimization on TensorFlow , 2017 .
[42] David A. Wood,et al. Supporting x86-64 address translation for 100s of GPU lanes , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[43] Trevor Darrell,et al. Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[44] Carole-Jean Wu,et al. MCM-GPU: Multi-chip-module GPUs for continued performance scalability , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[45] Hadi Esmaeilzadeh,et al. TABLA: A unified template-based framework for accelerating statistical machine learning , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[46] Alex Graves,et al. Neural Turing Machines , 2014, ArXiv.
[47] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.
[48] William J. Dally,et al. SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[49] Eriko Nurvitadhi,et al. Accelerating Deep Convolutional Networks using low-precision and sparsity , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[50] Amar Phanishayee,et al. TBD: Benchmarking and Analyzing Deep Neural Network Training , 2018, ArXiv.
[51] Alex Krizhevsky,et al. One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.
[52] Gennady Pekhimenko,et al. DNN-Train : Benchmarking and Analyzing DNN Training , 2018 .
[53] Tianshi Chen,et al. ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[54] Natalia Gimelshein,et al. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[55] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[56] Asit K. Mishra,et al. From high-level deep neural models to FPGAs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[57] NVIDIA DGX-1 System Architecture White paper , 2017 .
[58] Shaoli Liu,et al. Cambricon-X: An accelerator for sparse neural networks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[59] Vivienne Sze,et al. 14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks , 2016, ISSCC.
[60] Thomas F. Wenisch,et al. System-level implications of disaggregated memory , 2012, IEEE International Symposium on High-Performance Comp Architecture.
[61] Thomas F. Wenisch,et al. Disaggregated memory for expansion and sharing in blade servers , 2009, ISCA '09.
[62] Stephen W. Keckler,et al. Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks , 2017, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[63] J. Jeddeloh,et al. Hybrid memory cube new DRAM architecture increases density and performance , 2012, 2012 Symposium on VLSI Technology (VLSIT).
[64] Jia Wang,et al. DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[65] Tao Zhang,et al. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[66] Abhishek Bhattacharjee,et al. Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces , 2014, ASPLOS.
[67] Gu-Yeon Wei,et al. 14.3 A 28nm SoC with a 1.2GHz 568nJ/prediction sparse deep-neural-network engine with >0.1 timing error rate tolerance for IoT applications , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).
[68] Tianqi Chen,et al. Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.
[69] Pradeep Dubey,et al. On Scale-out Deep Learning Training for Cloud and HPC , 2018, ArXiv.
[70] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[71] Sudhakar Yalamanchili,et al. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).