DeftNN: Addressing Bottlenecks for DNN Execution on GPUs via Synapse Vector Elimination and Near-compute Data Fission
暂无分享,去创建一个
Scott A. Mahlke | Lingjia Tang | Parker Hill | Jason Mars | Michael Laurenzano | Chang-Hong Hsu | Babak Zamirai | Animesh Jain | Mason Hill | Animesh Jain | Parker Hill | M. Laurenzano | S. Mahlke | Lingjia Tang | Jason Mars | Mason Hill | Babak Zamirai | Chang-Hong Hsu | Mason Hill
[1] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.
[2] Simon Haykin,et al. GradientBased Learning Applied to Document Recognition , 2001 .
[3] Yann LeCun,et al. The mnist database of handwritten digits , 2005 .
[4] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[5] Yoshua Bengio,et al. An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.
[6] Norman P. Jouppi,et al. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[7] Andrew Zisserman,et al. Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.
[8] Johannes Schemmel,et al. Wafer-scale integration of analog neural networks , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).
[9] Moon Ho Lee,et al. Performance Analysis of Bit-Width Reduced Floating-Point Arithmetic Units in FPGAs: A Case Study of Neural Network-Based Face Detector , 2009, EURASIP J. Embed. Syst..
[10] Hyesoon Kim,et al. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.
[11] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[12] Hoi-Jun Yoo,et al. A 201.4 GOPS 496 mW Real-Time Multi-Object Recognition Processor With Bio-Inspired Neural Perception Engine , 2009, IEEE Journal of Solid-State Circuits.
[13] Jack J. Dongarra,et al. Accelerating GPU Kernels for Dense Linear Algebra , 2010, VECPAR.
[14] Srihari Cadambi,et al. A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.
[15] Quoc V. Le,et al. On optimization methods for deep learning , 2011, ICML.
[16] Berin Martini,et al. NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.
[17] Onur Mutlu,et al. Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[18] Vincent Vanhoucke,et al. Improving the speed of neural networks on CPUs , 2011 .
[19] Jim D. Garside,et al. SpiNNaker: A multi-core System-on-Chip for massively-parallel neural net simulation , 2012, Proceedings of the IEEE 2012 Custom Integrated Circuits Conference.
[20] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.
[21] Olivier Temam,et al. A defect-tolerant accelerator for emerging high-performance applications , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).
[22] Myron Flickner,et al. Compass: A scalable simulator for an architecture for cognitive computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[23] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[24] Scott A. Mahlke,et al. SAGE: Self-tuning approximation for graphics engines , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[25] Yoshua Bengio,et al. How transferable are features in deep neural networks? , 2014, NIPS.
[26] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.
[27] Erich Elsen,et al. Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.
[28] John Tran,et al. cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.
[29] Yu Wang,et al. Training itself: Mixed-signal training acceleration for memristor-based neural network , 2014, 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC).
[30] Jia Wang,et al. DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[31] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.
[32] Olivier Temam,et al. Leveraging the error resilience of machine-learning applications for designing highly energy efficient accelerators , 2014, 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC).
[33] Trevor Darrell,et al. Recognizing Image Style , 2013, BMVC.
[34] Quan Chen,et al. DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[35] Song Han,et al. Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.
[36] Karin Strauss,et al. Accelerating Deep Convolutional Neural Networks Using Specialized Hardware , 2015 .
[37] Margrit Betke,et al. Salient Object Subitizing , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[38] Thu D. Nguyen,et al. ApproxHadoop: Bringing Approximations to MapReduce Frameworks , 2015, ASPLOS.
[39] Michele Magno,et al. Accelerating real-time embedded scene labeling with convolutional networks , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).
[40] Andrew Lavin,et al. maxDNN: An Efficient Convolution Kernel for Deep Learning with Maxwell GPUs , 2015, ArXiv.
[41] Ronald G. Dreslinski,et al. Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers , 2015, ASPLOS.
[42] Ross B. Girshick,et al. Fast R-CNN , 2015, 1504.08083.
[43] Geoffrey E. Hinton,et al. Deep Learning , 2015, Nature.
[44] Pritish Narayanan,et al. Deep Learning with Limited Numerical Precision , 2015, ICML.
[45] Yoshua Bengio,et al. Low precision arithmetic for deep learning , 2014, ICLR.
[46] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.
[47] Luca Benini,et al. A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).
[48] Scott A. Mahlke,et al. Input responsiveness: using canary inputs to dynamically steer approximation , 2016, PLDI.
[49] Song Han,et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[50] Joel Emer,et al. Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.
[51] Natalie D. Enright Jerger,et al. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[52] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..
[53] Gu-Yeon Wei,et al. Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[54] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.
[55] Miao Hu,et al. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[56] Natalie D. Enright Jerger,et al. Proteus: Exploiting Numerical Precision Variability in Deep Neural Networks , 2016, ICS.
[57] Song Han,et al. Deep compression and EIE: Efficient inference engine on compressed deep neural network , 2016, 2016 IEEE Hot Chips 28 Symposium (HCS).
[58] Scott A. Mahlke,et al. Concise loads and stores: The case for an asymmetric compute-memory architecture for approximation , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[59] Natalie D. Enright Jerger,et al. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[60] Hadi Esmaeilzadeh,et al. Towards Statistical Guarantees in Controlling Quality Tradeoffs for Approximate Acceleration , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).