DRMap: A Generic DRAM Data Mapping Policy for Energy-Efficient Processing of Convolutional Neural Networks

Many convolutional neural network (CNN) accelerators face performance- and energy-efficiency challenges which are crucial for embedded implementations, due to high DRAM access latency and energy. Recently, some DRAM architectures have been proposed to exploit subarray-level parallelism for decreasing the access latency. Towards this, we present a design space exploration methodology to study the latency and energy of different mapping policies on different DRAM architectures, and identify the pareto-optimal design choices. The results show that the energy-efficient DRAM accesses can be achieved by a mapping policy that orderly prioritizes to maximize the row buffer hits, bank- and subarray-level parallelism.

[1]  Onur Mutlu,et al.  What Your DRAM Power Models Are Not Telling You: Lessons from a Detailed Experimental Study , 2018, SIGMETRICS.

[2]  Onur Mutlu,et al.  A case for exploiting subarray-level parallelism (SALP) in DRAM , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[3]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[4]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[5]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[6]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[7]  Muhammad Shafique,et al.  MPNA: A Massively-Parallel Neural Array Accelerator with Dataflow Optimization for Convolutional Neural Networks , 2018, ArXiv.

[8]  Onur Mutlu,et al.  Ramulator: A Fast and Extensible DRAM Simulator , 2016, IEEE Computer Architecture Letters.

[9]  Onur Mutlu,et al.  Demystifying Complex Workload-DRAM Interactions: An Experimental Study , 2019, SIGMETRICS.

[10]  Hyoukjun Kwon,et al.  MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects , 2018, ASPLOS.

[11]  Luca Benini,et al.  Optimally Scheduling CNN Convolutions for Efficient Memory Access , 2019, ArXiv.

[12]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[13]  Jiajun Li,et al.  SmartShuttle: Optimizing off-chip memory accesses for deep learning accelerators , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[14]  Xiaowei Li,et al.  FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[15]  Richard Veras,et al.  RAIDR: Retention-aware intelligent DRAM refresh , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[16]  Mor Harchol-Balter,et al.  Thread Cluster Memory Scheduling , 2011, IEEE Micro.

[17]  William J. Dally,et al.  SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[18]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[19]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[20]  Tianshi Chen,et al.  DaDianNao: A Neural Network Supercomputer , 2017, IEEE Transactions on Computers.

[21]  Hadi Esmaeilzadeh,et al.  Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network , 2017, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).