MoDNN: Local distributed mobile computing system for Deep Neural Network

Although Deep Neural Networks (DNN) are ubiquitously utilized in many applications, it is generally difficult to deploy DNNs on resource-constrained devices, e.g., mobile platforms. Some existing attempts mainly focus on client-server computing paradigm or DNN model compression, which require either infrastructure supports or special training phases, respectively. In this work, we propose MoDNN — a local distributed mobile computing system for DNN applications. MoDNN can partition already trained DNN models onto several mobile devices to accelerate DNN computations by alleviating device-level computing cost and memory usage. TWo model partition schemes are also designed to minimize non-parallel data delivery time, including both wakeup time and transmission time. Experimental results show that when the number of worker nodes increases from 2 to 4, MoDNN can accelerate the DNN computation by 2.17–4.28 X. Besides the parallel execution, the performance speedup also partially comes from the reduction of the data delivery time, e.g., 30.02% w.r.t. conventional 2D-grids partition.

[1]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[2]  Yixin Chen,et al.  Compressing Neural Networks with the Hashing Trick , 2015, ICML.

[3]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[4]  Yiran Chen,et al.  An EDA framework for large scale hybrid neuromorphic computing systems , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  Daniel Camps-Mur,et al.  Device-to-device communications with Wi-Fi Direct: overview and experimentation , 2013, IEEE Wireless Communications.

[7]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[8]  Yiran Chen,et al.  PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[9]  Yu Wang,et al.  FPGA Acceleration of Recurrent Neural Network Based Language Model , 2015, 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines.

[10]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[11]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[12]  David S. Johnson,et al.  Some Simplified NP-Complete Graph Problems , 1976, Theor. Comput. Sci..

[13]  Ronald G. Dreslinski,et al.  A hybrid approach to offloading mobile image classification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[15]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[16]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[17]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[18]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..