A 47.4µJ/epoch Trainable Deep Convolutional Neural Network Accelerator for In-Situ Personalization on Smart Devices

A scalable deep learning accelerator supporting both inference and training is implemented for device personalization of deep convolutional neural networks. It consists of three processor cores operating with distinct energy-efficient dataflow for different types of computation in CNN training. Two cores conduct forward and backward propagation in convolutional layers and utilize a masking scheme to reduce 88.3% of intermediate data to store for training. The third core executes weight update process in convolutional layers and inner product computation in fully connected layers with a novel large window dataflow. The system enables 8-bit fixed point datapath with lossless training and consumes $47.4\mu \mathrm{J}/\mathrm{epoch}$ for a customized deep CNN model.