SynergyFlow

Neural networks (NNs) have achieved great success in a broad range of applications. As NN-based methods are often both computation and memory intensive, accelerator solutions have been proved to be highly promising in terms of both performance and energy efficiency. Although prior solutions can deliver high computational throughput for convolutional layers, they could incur severe performance degradation when accommodating the entire network model, because there exist very diverse computing and memory bandwidth requirements between convolutional layers and fully connected layers and, furthermore, among different NN models. To overcome this problem, we proposed an elastic accelerator architecture, called SynergyFlow, which intrinsically supports layer-level and model-level parallelism for large-scale deep neural networks. SynergyFlow boosts the resource utilization by exploiting the complementary effect of resource demanding in different layers and different NN models. SynergyFlow can dynamically reconfigure itself according to the workload characteristics, maintaining a high performance and high resource utilization among various models. As a case study, we implement SynergyFlow on a P395-AB FPGA board. Under 100MHz working frequency, our implementation improves the performance by 33.8% on average (up to 67.2% on AlexNet) compared to comparable provisioned previous architectures.

[1]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[3]  Wenhao Huang,et al.  Deep process neural network for temporal deep learning , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[4]  William J. Dally,et al.  SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[5]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[6]  Keqin Li Optimal partitioning of a multicore server processor , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[7]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[8]  Samuel Williams,et al.  Auto-tuning performance on multicore computers , 2008 .

[9]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[10]  Xin He,et al.  NNest: Early-Stage Design Space Exploration Tool for Neural Network Inference Accelerators , 2018, ISLPED.

[11]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[12]  Srihari Cadambi,et al.  A programmable parallel accelerator for learning and classification , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[13]  Xuan Zhang,et al.  Joint Design of Training and Hardware Towards Efficient and Accuracy-Scalable Neural Network Inference , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[14]  Peter J. Haas,et al.  Large-scale matrix factorization with distributed stochastic gradient descent , 2011, KDD.

[15]  Hyoukjun Kwon,et al.  MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects , 2018, ASPLOS.

[16]  Xiaowei Li,et al.  An Analytical Framework for Estimating Scale-Out and Scale-Up Power Efficiency of Heterogeneous Manycores , 2016, IEEE Transactions on Computers.

[17]  Henk Corporaal,et al.  Memory-centric accelerator design for Convolutional Neural Networks , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[18]  Keqin Li Optimal Partitioning of a Multicore Server Processor , 2012, IPDPS Workshops.

[19]  Srihari Cadambi,et al.  A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.

[20]  Yu Cao,et al.  Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks , 2016, FPGA.

[21]  Christoforos E. Kozyrakis,et al.  Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.

[22]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[23]  Leibo Liu,et al.  Deep Convolutional Neural Network Architecture With Reconfigurable Computation Patterns , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[24]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[25]  Yu Wang,et al.  Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.

[26]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[27]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[28]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[29]  Tianshi Chen,et al.  ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[30]  Geoffrey E. Hinton,et al.  Learning to Label Aerial Images from Noisy Data , 2012, ICML.

[31]  Yann LeCun,et al.  CNP: An FPGA-based processor for Convolutional Networks , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[32]  Harry Dwyer,et al.  An out-of-order superscalar processor with speculative execution and fast, precise interrupts , 1992, MICRO 25.

[33]  Xuehai Zhou,et al.  PuDianNao: A Polyvalent Machine Learning Accelerator , 2015, ASPLOS.

[34]  Xin He,et al.  AxTrain: Hardware-Oriented Neural Network Training for Approximate Inference , 2018, ISLPED.

[35]  Samuel Williams,et al.  Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures , 2008 .

[36]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[37]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[39]  Vincent Vanhoucke,et al.  Improving the speed of neural networks on CPUs , 2011 .

[40]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[41]  Hoi-Jun Yoo,et al.  14.2 DNPU: An 8.1TOPS/W reconfigurable CNN-RNN processor for general-purpose deep neural networks , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).

[42]  Richard W. Vuduc,et al.  Balance Principles for Algorithm-Architecture Co-Design , 2011, HotPar.

[43]  Zhen Lin,et al.  Implementation and evaluation of deep neural networks (DNN) on mainstream heterogeneous systems , 2014, APSys.

[44]  Scott A. Mahlke,et al.  Bridging the computation gap between programmable processors and hardwired accelerators , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[45]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[46]  Yu Wang,et al.  Training itself: Mixed-signal training acceleration for memristor-based neural network , 2014, 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC).

[47]  Jiajun Li,et al.  SmartShuttle: Optimizing off-chip memory accesses for deep learning accelerators , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[48]  Xiao-Wei Li,et al.  CCR: A concise convolution rule for sparse neural network accelerators , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[49]  Srihari Cadambi,et al.  A Massively Parallel Coprocessor for Convolutional Neural Networks , 2009, 2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors.

[50]  Xiaowei Li,et al.  FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[51]  Shaoli Liu,et al.  Cambricon-X: An accelerator for sparse neural networks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[52]  Manoj Alwani,et al.  Fused-layer CNN accelerators , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[53]  Luis Ceze,et al.  Neural Acceleration for General-Purpose Approximate Programs , 2014, IEEE Micro.

[54]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[55]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.