NN-Baton: DNN Workload Orchestration and Chiplet Granularity Exploration for Multichip Accelerators
暂无分享,去创建一个
Kaisheng Ma | Zhanhong Tan | Runpei Dong | Hongyu Cai | Kaisheng Ma | Runpei Dong | Zhanhong Tan | Hongyu Cai
[1] Pengfei Xu,et al. AutoDNNchip: An Automated DNN Chip Predictor and Builder for Both FPGAs and ASICs , 2020, FPGA.
[2] Xiaowei Li,et al. FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[3] Andrew Stewart,et al. 10.1 A pin-efficient 20.83Gb/s/wire 0.94pJ/bit forwarded clock CNRZ-5-coded SerDes up to 12mm for MCM packages in 28nm CMOS , 2016, 2016 IEEE International Solid-State Circuits Conference (ISSCC).
[4] Tianshi Chen,et al. ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[5] Meng-Fan Chang,et al. 15.4 A 22nm 2Mb ReRAM Compute-in-Memory Macro with 121-28TOPS/W for Multibit MAC Computing for Tiny AI Edge Devices , 2020, 2020 IEEE International Solid- State Circuits Conference - (ISSCC).
[6] Ben H. H. Juurlink,et al. (When) Will CMPs Hit the Power Wall? , 2009, Euro-Par Workshops.
[7] Jason Cong,et al. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.
[8] David M. Brooks,et al. Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[9] Mariagrazia Graziano,et al. New Logic-In-Memory Paradigms: An Architectural and Technological Perspective , 2019, Micromachines.
[10] Rong Jin,et al. 7.2 A 12nm Programmable Convolution-Efficient Neural-Processing-Unit Chip Achieving 825TOPS , 2020, 2020 IEEE International Solid- State Circuits Conference - (ISSCC).
[11] Mark Sandler,et al. MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[12] Geoffrey Zweig,et al. From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[13] William J. Dally,et al. Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture , 2019, MICRO.
[14] Norman P. Jouppi,et al. Google's Training Chips Revealed: TPUv2 and TPUv3 , 2020, 2020 IEEE Hot Chips 32 Symposium (HCS).
[15] Xi Chen,et al. A 1.17pJ/b 25Gb/s/pin ground-referenced single-ended serial link for off- and on-package communication in 16nm CMOS using a process- and temperature-adaptive voltage regulator , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).
[16] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[17] William J. Dally,et al. A 0.11 pJ/Op, 0.32-128 TOPS, Scalable Multi-Chip-Module-based Deep Neural Network Accelerator with Ground-Reference Signaling in 16nm , 2019, 2019 Symposium on VLSI Circuits.
[18] Sally A. McKee,et al. Hitting the memory wall: implications of the obvious , 1995, CARN.
[19] William J. Dally,et al. A 0.32–128 TOPS, Scalable Multi-Chip-Module-Based Deep Neural Network Inference Accelerator With Ground-Referenced Signaling in 16 nm , 2020, IEEE Journal of Solid-State Circuits.
[20] Vivek Sarkar,et al. Understanding Reuse, Performance, and Hardware Cost of DNN Dataflow: A Data-Centric Approach , 2018, MICRO.
[21] Hoi-Jun Yoo,et al. B-Face: 0.2 MW CNN-Based Face Recognition Processor with Face Alignment for Mobile User Identification , 2018, 2018 IEEE Symposium on VLSI Circuits.
[22] Karthikeyan Sankaralingam,et al. Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.
[23] Sean White,et al. ‘Zeppelin’: An SoC for multichip architectures , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).
[24] Natalie D. Enright Jerger,et al. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[25] Rob A. Rutenbar,et al. A Scalable Bayesian Inference Accelerator for Unsupervised Learning , 2020, 2020 IEEE Hot Chips 32 Symposium (HCS).
[26] Lei He,et al. Light-OPU: An FPGA-based Overlay Processor for Lightweight Convolutional Neural Networks , 2020, FPGA.
[27] Ron Ho,et al. 3.3 A 14nm 1GHz FPGA with 2.5D transceiver integration , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).
[28] Jason Cong,et al. Caffeine: Toward Uniformed Representation and Acceleration for Deep Convolutional Neural Networks , 2019, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
[29] Minsoo Rhu,et al. Centaur: A Chiplet-based, Hybrid Sparse-Dense Accelerator for Personalized Recommendations , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).
[30] William J. Dally,et al. A 0.11 PJ/OP, 0.32-128 Tops, Scalable Multi-Chip-Module-Based Deep Neural Network Accelerator Designed with A High-Productivity vlsi Methodology , 2019, 2019 IEEE Hot Chips 31 Symposium (HCS).
[31] Chia-Hung Liu,et al. 7.1 A 3.4-to-13.3TOPS/W 3.6TOPS Dual-Core Deep-Learning Accelerator for Versatile AI Applications in 7nm 5G Smartphone SoC , 2020, 2020 IEEE International Solid- State Circuits Conference - (ISSCC).
[32] Teja Singh,et al. 2.1 Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core , 2020, 2020 IEEE International Solid- State Circuits Conference - (ISSCC).
[33] Marian Verhelst,et al. 14.5 Envision: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable Convolutional Neural Network processor in 28nm FDSOI , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).
[34] Jing Xia,et al. DaVinci: A Scalable Architecture for Neural Network Computing , 2019, 2019 IEEE Hot Chips 31 Symposium (HCS).
[35] Natalie D. Enright Jerger,et al. Modular Routing Design for Chiplet-Based Systems , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).
[36] Hao-Jie Zhan,et al. A 16nm 256-bit wide 89.6GByte/s total bandwidth in-package interconnect with 0.3V swing and 0.062pJ/bit power in InFO package , 2016, 2016 IEEE Hot Chips 28 Symposium (HCS).
[37] Christoforos E. Kozyrakis,et al. TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators , 2019, ASPLOS.
[38] Ali Farhadi,et al. You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[39] Meng-Fan Chang,et al. 14.3 A 65nm Computing-in-Memory-Based CNN Processor with 2.9-to-35.8TOPS/W System Energy Efficiency Using Dynamic-Sparsity Performance-Scaling Architecture and Energy-Efficient Inter/Intra-Macro Data Reuse , 2020, 2020 IEEE International Solid- State Circuits Conference - (ISSCC).
[40] Ralph Wittig,et al. Xilinx Versal™ Premium , 2020, 2020 IEEE Hot Chips 32 Symposium (HCS).
[41] Vivienne Sze,et al. Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.
[42] Yoon Kim,et al. Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.
[43] Berin Martini,et al. NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.
[44] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.
[45] William J. Dally,et al. MAGNet: A Modular Accelerator Generator for Neural Networks , 2019, 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).
[46] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.
[47] Joel Emer,et al. Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.
[48] Meng-Fan Chang,et al. Sticker: A 0.41-62.1 TOPS/W 8Bit Neural Network Processor with Multi-Sparsity Compatible Convolution Arrays and Online Tuning Acceleration for Fully Connected Layers , 2018, 2018 IEEE Symposium on VLSI Circuits.
[49] Hoi-Jun Yoo,et al. UNPU: A 50.6TOPS/W unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).
[50] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[51] Christoforos E. Kozyrakis,et al. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.
[52] Liang Han,et al. Hanguang 800 NPU – The Ultimate AI Inference Solution for Data Centers , 2020, 2020 IEEE Hot Chips 32 Symposium (HCS).
[53] Yu Cao,et al. Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks , 2017, FPGA.
[54] Vivienne Sze,et al. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks , 2017, IEEE Journal of Solid-State Circuits.
[55] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[56] Brucek Khailany,et al. Timeloop: A Systematic Approach to DNN Accelerator Evaluation , 2019, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[57] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[58] Jason Cong,et al. Scaling for edge inference of deep neural networks , 2018 .
[59] Debjit Das Sarma,et al. Computer and Redundancy Solution for the Full Self-Driving Computer , 2019, 2019 IEEE Hot Chips 31 Symposium (HCS).
[60] Jae-Gon Lee,et al. 7.1 An 11.5TOPS/W 1024-MAC Butterfly Structure Dual-Core Sparsity-Aware Neural Processing Unit in 8nm Flagship Mobile SoC , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).
[61] Carole-Jean Wu,et al. MCM-GPU: Multi-chip-module GPUs for continued performance scalability , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[62] William J. Dally,et al. SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[63] Patrick Groeneveld,et al. ISPD 2020 Physical Mapping of Neural Networks on a Wafer-Scale Deep Learning Accelerator , 2020, ISPD.
[64] Yun Liang,et al. REQ-YOLO: A Resource-Aware, Efficient Quantization Framework for Object Detection on FPGAs , 2019, FPGA.
[65] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[66] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[67] Jack Hu,et al. A 7nm 4GHz Arm®-core-based CoWoS® Chiplet Design for High Performance Computing , 2019, 2019 Symposium on VLSI Circuits.
[68] Yong Wang,et al. Baidu Kunlun An AI processor for diversified workloads , 2020, 2020 IEEE Hot Chips 32 Symposium (HCS).