暂无分享,去创建一个
Niraj K. Jha | Ye Yu | N. Jha | Y. Yu
[1] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.
[2] Martin D. Schatz,et al. Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications , 2018, ArXiv.
[3] F. Clermidy,et al. 3D sequential integration opportunities and technology optimization , 2014, IEEE International Interconnect Technology Conference.
[4] Dongrui Fan,et al. Accelerating CNN Algorithm with Fine-Grained Dataflow Architectures , 2018, 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS).
[5] Li Fei-Fei,et al. Progressive Neural Architecture Search , 2017, ECCV.
[6] Bo-Cheng Lai,et al. Supporting compressed-sparse activations and weights on SIMD-like accelerator for sparse convolutional neural networks , 2018, 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC).
[7] Dajiang Zhou,et al. Chain-NN: An energy-efficient 1D chain architecture for accelerating deep convolutional neural networks , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.
[8] Tianshi Chen,et al. ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[9] Vijay Vasudevan,et al. Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[10] Tao Wang,et al. Image Classification at Supercomputer Scale , 2018, ArXiv.
[11] Natalie D. Enright Jerger,et al. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[12] Yuanzhou Yang,et al. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes , 2018, ArXiv.
[13] Sudhakar Yalamanchili,et al. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[14] Andreas Moshovos,et al. Bit-Pragmatic Deep Neural Network Computing , 2016, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[15] Hyoukjun Kwon,et al. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects , 2018, ASPLOS.
[16] Jinjun Xiong,et al. Application-Transparent Near-Memory Processing Architecture with Memory Channel Network , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[17] Sepp Hochreiter,et al. The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..
[18] Pritish Narayanan,et al. Deep Learning with Limited Numerical Precision , 2015, ICML.
[19] Wonyong Sung,et al. FPGA based implementation of deep neural networks using on-chip memory only , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[20] Tajana Rosing,et al. NNPIM: A Processing In-Memory Architecture for Neural Network Acceleration , 2019, IEEE Transactions on Computers.
[21] Jarrod A. Roy,et al. Capo: robust and scalable open-source min-cut floorplacer , 2005, ISPD '05.
[22] Richard Vuduc,et al. Automatic performance tuning of sparse matrix kernels , 2003 .
[23] Diederik Verkest,et al. Physical Design Solutions to Tackle FEOL/BEOL Degradation in Gate-level Monolithic 3D ICs , 2016, ISLPED.
[24] Yu Wang,et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.
[25] Michael Ferdman,et al. Escher: A CNN Accelerator with Flexible Buffering to Minimize Off-Chip Transfer , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).
[26] Tor M. Aamodt,et al. Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[27] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.
[28] Xiaowei Li,et al. FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[29] Mengjia Yan,et al. UCNN: Exploiting Computational Reuse in Deep Neural Networks via Weight Repetition , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).
[30] Jung Ho Ahn,et al. Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[31] Michael Ferdman,et al. Overcoming resource underutilization in spatial CNN accelerators , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).
[32] Shaoli Liu,et al. Cambricon-X: An accelerator for sparse neural networks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[33] Jason Cong,et al. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).
[34] Pradeep Dubey,et al. SCALEDEEP: A scalable compute architecture for learning and evaluating deep networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[35] Niraj K. Jha,et al. Hybrid Monolithic 3-D IC Floorplanner , 2018, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.
[36] Thomas F. La Porta,et al. Modeling the Resource Requirements of Convolutional Neural Networks on Mobile Devices , 2017, ACM Multimedia.
[37] Denis Foley,et al. Ultra-Performance Pascal GPU and NVLink Interconnect , 2017, IEEE Micro.
[38] Sung Kyu Lim,et al. Ultra-high density 3D SRAM cell designs for monolithic 3D integration , 2012, 2012 IEEE International Interconnect Technology Conference.
[39] Hao Wu,et al. Mixed Precision Training , 2017, ICLR.
[40] Scott A. Mahlke,et al. Scalpel: Customizing DNN pruning to the underlying hardware parallelism , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[41] O. Rozeau,et al. Compact 6T SRAM cell with robust read/write stabilizing design in 45nm Monolithic 3D IC technology , 2009, 2009 IEEE International Conference on IC Design and Technology.
[42] Song Han,et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[43] Stephen W. Keckler,et al. Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks , 2017, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[44] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.
[45] Jia Wang,et al. DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[46] Niraj K. Jha,et al. Energy-Efficient Monolithic Three-Dimensional On-Chip Memory Architectures , 2018, IEEE Transactions on Nanotechnology.
[47] Olivier Giroux,et al. Volta: Performance and Programmability , 2018, IEEE Micro.
[48] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.
[49] Sung Kyu Lim,et al. Power-performance study of block-level monolithic 3D-ICs considering inter-tier performance variations , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).
[50] Pradeep Dubey,et al. Mixed Precision Training of Convolutional Neural Networks using Integer Operations , 2018, ICLR.
[51] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[52] Miao Hu,et al. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[53] Nikko Strom,et al. Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.
[54] Sergey Ioffe,et al. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.
[55] Olivier Billoint,et al. Intermediate BEOL process influence on power and performance for 3DVLSI , 2015, 2015 International 3D Systems Integration Conference (3DIC).
[56] Edward Y. Chang,et al. Distributed Training Large-Scale Deep Architectures , 2017, ADMA.
[57] William J. Dally,et al. SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[58] Miriam Bellver,et al. Distributed training strategies for a computer vision deep learning algorithm on a distributed GPU cluster , 2017, ICCS.
[59] Yoshua Bengio,et al. Training deep neural networks with low precision multiplications , 2014 .
[60] Cong Xu,et al. NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
[61] Onur Mutlu,et al. Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[62] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[63] Sung Kyu Lim,et al. How to Cope with Slow Transistors in the Top-tier of Monolithic 3D ICs: Design Studies and CAD Solutions , 2016, ISLPED.
[64] Norman P. Jouppi,et al. CACTI 6.0: A Tool to Model Large Caches , 2009 .
[65] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[66] Christoforos E. Kozyrakis,et al. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.
[67] Krste Asanovic,et al. Convergence and scalarization for data-parallel architectures , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[68] Tianshi Chen,et al. Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[69] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.
[70] Tao Zhang,et al. NVMain 2.0: A User-Friendly Memory Simulator to Model (Non-)Volatile Memory Systems , 2015, IEEE Computer Architecture Letters.
[71] Niraj K. Jha,et al. A Monolithic 3D Hybrid Architecture for Energy-Efficient Computation , 2018, IEEE Transactions on Multi-Scale Computing Systems.
[72] D. Williamson. Dynamically scaled fixed point arithmetic , 1991, [1991] IEEE Pacific Rim Conference on Communications, Computers and Signal Processing Conference Proceedings.
[73] Song Han,et al. Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.
[74] Chi-Ying Tsui,et al. SparseNN: An energy-efficient neural network accelerator exploiting input and output sparsity , 2017, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).
[75] Manoj Alwani,et al. Fused-layer CNN accelerators , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[76] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[77] Niraj K. Jha,et al. Software-Defined Design Space Exploration for an Efficient DNN Accelerator Architecture , 2019, IEEE Transactions on Computers.
[78] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[79] Heng-Yuan Lee,et al. A 5ns fast write multi-level non-volatile 1 K bits RRAM memory with advance write scheme , 2009, 2009 Symposium on VLSI Circuits.
[80] Chao Wang,et al. CirCNN: Accelerating and Compressing Deep Neural Networks Using Block-Circulant Weight Matrices , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[81] Yu Wang,et al. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[82] Alireza Shafaei,et al. FinCACTI: Architectural Analysis and Modeling of Caches with Deeply-Scaled FinFET Devices , 2014, 2014 IEEE Computer Society Annual Symposium on VLSI.
[83] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.
[84] Mark Sandler,et al. MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[85] Hadi Esmaeilzadeh,et al. Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network , 2017, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).
[86] Patrick Judd,et al. Stripes: Bit-serial deep neural network computing , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).