Real-time meets approximate computing: An elastic CNN inference accelerator with adaptive trade-off between QoS and QoR

Due to the recent progress in deep learning and neural acceleration architectures, specialized deep neural network or convolutional neural network (CNNs) accelerators are expected to provide an energy-efficient solution for real-time vision/speech processing, recognition and a wide spectrum of approximate computing applications. In addition to their wide applicability scope, we also found that the fascinating feature of deterministic performance and high energy-efficiency, makes such deep learning (DL) accelerators ideal candidates as application-processor IPs in embedded SoCs concerned with real-time processing. However, unlike traditional accelerator designs, DL accelerators introduce a new aspect of design trade-off between real-time processing (QoS) and computation approximation (QoR) into embedded systems. This work proposes an elastic CNN acceleration architecture that automatically adapts to the hard QoS constraint by exploiting the error-resilience in typical approximate computing workloads. For the first time, the proposed design, including network tuning-and-mapping software and reconfigurable accelerator hardware, aims to reconcile the design constraint of QoS and Quality of Result (QoR), which are respectively the key concerns in real-time and approximate computing. It is shown in experiments that the proposed architecture enables the embedded system to work flexibly in an expanded operating space, significantly enhances its real-time ability, and maximizes the energy-efficiency of system within the user-specified QoS-QoR constraint through self-reconfiguration.