RaPiD: AI Accelerator for Ultra-low Precision Training and Inference
暂无分享,去创建一个
Joel Silberman | Swagath Venkataramani | Matthew M. Ziegler | Mingu Kang | Moriyoshi Ohara | Vijayalakshmi Srinivasan | Mauricio J. Serrano | Monodeep Kar | Ashish Ranjan | Sanchari Sen | Shubham Jain | Sunil Shukla | Kailash Gopalakrishnan | Jungwook Choi | Jinwook Oh | Jinwook Jung | Chia-Yu Chen | Kazuaki Ishizaki | Leland Chang | Allison Allain | Nianzheng Cao | Wei Wang | Brian W. Curran | Vidhi Zalani | Bruce M. Fleischer | Alberto Mannari | Marcel Schaal | Ching Zhou | Kyu-Hyoun Kim | Ankur Agrawal | Zhibin Ren | Kerstin Schelm | Michael Guillorn | Howard Haynie | Eri Ogawa | James Bonanno | Robert Casatuta | Scot Rider | Naigang Wang | Xiao Sun | Jintao Zhang | Hoang Tran | Yulong Li | Hiroshi Inoue | Matthew Cohen | Siyu Koswatta | Sae Kyu Lee | Martin Lutz | Silvia Mueller | Michael Scheuermann | Jie Yang | Xin Zhang | Vinay Shah | Pong-Fei Lu | A. Agrawal | K. Gopalakrishnan | Chia-Yu Chen | Jungwook Choi | Jinwook Oh | W. Wang | J. Silberman | Shubham Jain | P. Lu | S. Mueller | Xiao Sun | Naigang Wang | Swagath Venkataramani | S. Koswatta | Leland Chang | V. Srinivasan | M. Scheuermann | Z. Ren | B. Fleischer | H. Inoue | M. Serrano | Mingu Kang | Jinwook Jung | M. Ziegler | Jie-quan Yang | Monodeep Kar | Ashish Ranjan | Kazuaki Ishizaki | Moriyoshi Ohara | Sanchari Sen | Jintao Zhang | B. Curran | Alberto Mannari | H. Tran | Yulong Li | Eri Ogawa | M. Schaal | Allison Allain | J. Bonanno | N. Cao | Robert Casatuta | Matthew Cohen | Michael Guillorn | Howard Haynie | Kyu-Hyoun Kim | Martin Lutz | Scot Rider | Kerstin Schelm | V. Zalani | Xin Zhang | Ching Zhou | Vinay Shah | Sunil Shukla | K. Ishizaki | Vijayalakshmi Srinivasan | Pong-Fei Lu
[1] Kevin Duh,et al. Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning , 2020, RepL4NLP@ACL.
[2] Swagath Venkataramani,et al. PACT: Parameterized Clipping Activation for Quantized Neural Networks , 2018, ArXiv.
[3] Xiangyu Zhang,et al. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[4] Kaushik Roy,et al. Quality programmable vector processors for approximate computing , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[5] Quoc V. Le,et al. Neural Architecture Search with Reinforcement Learning , 2016, ICLR.
[6] Joel Silberman,et al. A Scalable Multi- TeraOPS Deep Learning Processor Core for AI Trainina and Inference , 2018, 2018 IEEE Symposium on VLSI Circuits.
[7] Berin Martini,et al. NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.
[8] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.
[9] Dan Alistarh,et al. Model compression via distillation and quantization , 2018, ICLR.
[10] Forrest N. Iandola,et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.
[11] Hiroshi Inoue,et al. DeepTools: Compiler and Execution Runtime Extensions for RaPiD AI Accelerator , 2019, IEEE Micro.
[12] Michael Behar,et al. Spring Hill (NNP-I 1000) Intel’s Data Center Inference Chip , 2019, 2019 IEEE Hot Chips 31 Symposium (HCS).
[13] Dejan S. Milojicic,et al. PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference , 2019, ASPLOS.
[14] Jian Sun,et al. Deep Learning with Low Precision by Half-Wave Gaussian Quantization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Song Han,et al. Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.
[16] Xiaowei Li,et al. FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[17] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[18] Trevor Darrell,et al. Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[19] Jia Wang,et al. DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[20] Pradeep Dubey,et al. SCALEDEEP: A scalable compute architecture for learning and evaluating deep networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[21] Erich Elsen,et al. Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.
[22] Jason Cong,et al. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.
[23] Swagath Venkataramani,et al. Memory and Interconnect Optimizations for Peta-Scale Deep Learning Systems , 2019, 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC).
[24] Srihari Cadambi,et al. A Massively Parallel, Energy Efficient Programmable Accelerator for Learning and Classification , 2012, TACO.
[25] Natalia Gimelshein,et al. Virtualizing Deep Neural Networks for Memory-Efficient Neural Network Design , 2016, ArXiv.
[26] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[27] Joel Emer,et al. Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.
[28] Yiran Chen,et al. PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[29] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[30] Suyog Gupta,et al. To prune, or not to prune: exploring the efficacy of pruning for model compression , 2017, ICLR.
[31] Xiang Zhang,et al. Text Understanding from Scratch , 2015, ArXiv.
[32] Jun Yao,et al. A CGRA-Based Approach for Accelerating Convolutional Neural Networks , 2015, 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip.
[33] Stephen W. Keckler,et al. Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks , 2017, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[34] Natalie D. Enright Jerger,et al. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[35] Baochun Li,et al. Spotlight: Optimizing Device Placement for Training Deep Neural Networks , 2018, ICML.
[36] Vinay P. Namboodiri,et al. Multi-layer Pruning Framework for Compressing Single Shot MultiBox Detector , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).
[37] Sudhakar Yalamanchili,et al. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[38] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[39] Swagath Venkataramani,et al. A 3.0 TFLOPS 0.62V Scalable Processor Core for High Compute Utilization AI Training and Inference , 2020, 2020 IEEE Symposium on VLSI Circuits.
[40] Vincent Vanhoucke,et al. Improving the speed of neural networks on CPUs , 2011 .
[41] Margo I. Seltzer,et al. Towards General-Purpose Neural Network Computing , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).
[42] Christoforos E. Kozyrakis,et al. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.
[43] James T. Kwok,et al. Loss-aware Weight Quantization of Deep Networks , 2018, ICLR.
[44] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.
[45] Swagath Venkataramani,et al. DyHard-DNN: Even More DNN Acceleration with Dynamic Hardware Reconfiguration , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).
[46] Charbel Sakr,et al. Accumulation Bit-Width Scaling For Ultra-Low Precision Training Of Deep Networks , 2019, ICLR.
[47] Eric S. Chung,et al. A Configurable Cloud-Scale DNN Processor for Real-Time AI , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).
[48] Samy Bengio,et al. Device Placement Optimization with Reinforcement Learning , 2017, ICML.
[49] Forrest N. Iandola,et al. FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[50] Yu Cao,et al. Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks , 2016, FPGA.
[51] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[52] Zhuo Wang,et al. In-Memory Computation of a Machine-Learning Classifier in a Standard 6T SRAM Array , 2017, IEEE Journal of Solid-State Circuits.
[53] Berin Martini,et al. A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.
[54] Xiang Zhang,et al. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.
[55] Tao Zhang,et al. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[56] Patrick Judd,et al. Stripes: Bit-serial deep neural network computing , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[57] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .
[58] Jae-Joon Han,et al. Learning to Quantize Deep Networks by Optimizing Quantization Intervals With Task Loss , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[59] Vivienne Sze,et al. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks , 2017, IEEE Journal of Solid-State Circuits.
[60] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[61] Wei Liu,et al. SSD: Single Shot MultiBox Detector , 2015, ECCV.
[62] Song Han,et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[63] Daniel Brand,et al. Training Deep Neural Networks with 8-bit Floating Point Numbers , 2018, NeurIPS.
[64] Song Han,et al. Trained Ternary Quantization , 2016, ICLR.
[65] Anke Schmeink,et al. Variational Network Quantization , 2018, ICLR.
[66] Josep Torrellas,et al. SAVE: Sparsity-Aware Vector Engine for Accelerating DNN Training and Inference on CPUs , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[67] Alaa R. Alameldeen,et al. ZCOMP: Reducing DNN Cross-Layer Memory Footprint Using Vector Extensions , 2019, MICRO.
[68] Gu-Yeon Wei,et al. Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[69] Miao Hu,et al. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[70] William J. Dally,et al. Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture , 2019, MICRO.
[71] Joel Silberman,et al. A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-Aware Throttling , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).
[72] Swagath Venkataramani,et al. Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks , 2019, NeurIPS.
[73] Sergey Ioffe,et al. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.
[74] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.
[75] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[76] Anand Raghunathan,et al. SparCE: Sparsity Aware General-Purpose Core Extensions to Accelerate Deep Neural Networks , 2017, IEEE Transactions on Computers.
[77] Pradeep Dubey,et al. Distributed Deep Learning Using Synchronous Stochastic Gradient Descent , 2016, ArXiv.
[78] Bo Chen,et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.
[79] Srihari Cadambi,et al. A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.
[80] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[81] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.
[82] Hadi Esmaeilzadeh,et al. Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network , 2017, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).
[83] Dong Han,et al. Cambricon: An Instruction Set Architecture for Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).