Towards Memory-Efficient Allocation of CNNs on Processing-in-Memory Architecture

Convolutional neural networks (CNNs) have been successfully applied in artificial intelligent systems to perform sensory processing, sequence learning, and image processing. In contrast to conventional computing-centric applications, CNNs are known to be both computationally and memory intensive. The computational and memory resources of CNN applications are mixed together in the network weights. This incurs a significant amount of data movement, especially for high-dimensional convolutions. The emerging Processing-in-Memory (PIM) alleviates this memory bottleneck by integrating both processing elements and memory into a 3D-stacked architecture. Although this architecture can offer fast near-data processing to reduce data movement, memory is still a limiting factor of the entire system. We observe that an unsolved key challenge is how to efficiently allocate convolutions to 3D-stacked PIM to combine the advantages of both neural and computational processing. This paper presents MemoNet, a memory-efficient data allocation strategy for convolutional neural networks on 3D PIM architecture. MemoNet offers fine-grained parallelism that can fully exploit the computational power of PIM architecture. The objective is to capture the characteristics of neural network applications and perfectly match the underlining hardware resources provided by PIM, resulting in a hardware-independent design to transparently allocate data. We formulate the target problem as a dynamic programming model and present an optimal solution. To demonstrate the viability of the proposed MemoNet, we conduct a set of experiments using a variety of realistic convolutional neural network applications. The extensive evaluations show that, MemoNet can significantly improve the performance and the cache utilization compared to representative schemes.

[1]  Luca Benini,et al.  A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[2]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[3]  Alireza Ejlali,et al.  NPAM: NVM-Aware Page Allocation for Multi-Core Embedded Systems , 2017, IEEE Transactions on Computers.

[4]  Eduard Ayguadé,et al.  Task Scheduling Techniques for Asymmetric Multi-Core Systems , 2017, IEEE Transactions on Parallel and Distributed Systems.

[5]  Yiran Chen,et al.  A new learning method for inference accuracy, core occupation, and performance co-optimization on TrueNorth chip , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[6]  Laurence T. Yang,et al.  Resource Sharing in Multicore Mixed-Criticality Systems: Utilization Bound and Blocking Overhead , 2017, IEEE Transactions on Parallel and Distributed Systems.

[7]  Christoforos E. Kozyrakis,et al.  Practical Near-Data Processing for In-Memory Analytics Frameworks , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[8]  Yu Wang,et al.  Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGA , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[9]  Youyou Lu,et al.  Improving the Performance and Endurance of Persistent Memory with Loose-Ordering Consistency , 2017, ArXiv.

[10]  Gi-Ho Park,et al.  NVM Way Allocation Scheme to Reduce NVM Writes for Hybrid Cache Architecture in Chip-Multiprocessors , 2017, IEEE Transactions on Parallel and Distributed Systems.

[11]  Mingyu Gao,et al.  HRL: Efficient and flexible reconfigurable logic for near-data processing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[12]  Yann LeCun,et al.  Convolutional networks and applications in vision , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[13]  Sergio Gomez Colmenarejo,et al.  Hybrid computing using a neural network with dynamic external memory , 2016, Nature.

[14]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[15]  Jing Huang,et al.  Energy-Efficient Resource Utilization for Heterogeneous Embedded Computing Systems , 2017, IEEE Transactions on Computers.

[16]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[17]  Luca Benini,et al.  YodaNN: An Architecture for Ultralow Power Binary-Weight CNN Acceleration , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[18]  Laurence T. Yang,et al.  Multicore Mixed-Criticality Systems: Partitioned Scheduling and Utilization Bound , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[19]  Yi Lin,et al.  Durable and Energy Efficient In-Memory Frequent-Pattern Mining , 2017, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[20]  Kiyoung Choi,et al.  Efficient FPGA acceleration of Convolutional Neural Networks using logical-3D compute array , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[21]  Franz Franchetti,et al.  Data reorganization in memory using 3D-stacked DRAM , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[22]  Soheil Ghiasi,et al.  Implementation-Aware Model Analysis: The Case of Buffer-Throughput Tradeoff in Streaming Applications , 2015, LCTES.

[23]  Nikil D. Dutt,et al.  SPARTA: Runtime task allocation for energy efficient heterogeneous manycores , 2016, 2016 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[24]  Yang Li,et al.  Non-Volatile Memory Based Page Swapping for Building High-Performance Mobile Devices , 2017, IEEE Transactions on Computers.

[25]  Tao Zhang,et al.  PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[26]  Shimeng Yu,et al.  MNSIM: Simulation platform for memristor-based neuromorphic computing system , 2016, DATE 2016.

[27]  Onur Mutlu,et al.  Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[28]  Dakshina Dasari,et al.  Time-Triggered Co-Scheduling of Computation and Communication with Jitter Requirements , 2017, IEEE Transactions on Computers.

[29]  Soheil Ghiasi,et al.  Look into details: the benefits of fine-grain streaming buffer analysis , 2010, LCTES '10.

[30]  Manoj Alwani,et al.  Fused-layer CNN accelerators , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[31]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[32]  Waqar Ali,et al.  BWLOCK: A Dynamic Memory Access Control Framework for Soft Real-Time Applications on Multicore Platforms , 2017, IEEE Transactions on Computers.

[33]  Yannis Papakonstantinou,et al.  SSD In-Storage Computing for Search Engines , 2016 .

[34]  Zili Shao,et al.  Memory-Aware Task Scheduling with Communication Overhead Minimization for Streaming Applications on Bus-Based Multiprocessor System-on-Chips , 2014, IEEE Transactions on Parallel and Distributed Systems.

[35]  Jing Yang,et al.  Towards memory-efficient processing-in-memory architecture for convolutional neural networks , 2017, LCTES.

[36]  Sudhakar Yalamanchili,et al.  Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[37]  Yi Wang,et al.  vFlash: Virtualized Flash for Optimizing the I/O Performance in Mobile Devices , 2017, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[38]  Jie Xu,et al.  DeepBurning: Automatic generation of FPGA-based learning accelerators for the Neural Network family , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[39]  Aaron Smith,et al.  A machine learning approach to mapping streaming workloads to dynamic multicore processors , 2016, LCTES.