QS-NAS: Optimally Quantized Scaled Architecture Search to Enable Efficient On-Device Micro-AI

Because of their simple hardware requirements, low bitwidth neural networks (NN) have gained significant attention over the recent years, and have been extensively employed in the state-of-the-art devices that seek efficiency and performance. Research has shown that scaled-up low bitwidth NNs can have accuracy levels on par with their full-precision counterparts. As a result, there is a trade-off between quantization (<inline-formula> <tex-math notation="LaTeX">$q$ </tex-math></inline-formula>) and scaling (<inline-formula> <tex-math notation="LaTeX">$s$ </tex-math></inline-formula>) of NNs to maintain the accuracy. To capture that trade-off, in this paper, we propose QS-NAS which is a systematic approach to explore the best quantization and scaling factors for a NN architecture that satisfies a targeted accuracy level and results in the least energy consumption per inference when deployed to a hardware–FPGA in this work. We first approximate the accuracy of a NN using a polynomial regression based on experiencing over a span of <inline-formula> <tex-math notation="LaTeX">$q$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$s$ </tex-math></inline-formula>. Then, we design a hardware that is scalable with <inline-formula> <tex-math notation="LaTeX">$P$ </tex-math></inline-formula> processing engines (PE) and <inline-formula> <tex-math notation="LaTeX">$M$ </tex-math></inline-formula> multipliers per PE, and infer that the configuration of the most energy-efficient hardware as well as its energy per inference for a NN <inline-formula> <tex-math notation="LaTeX">$\langle q,\,s\rangle $ </tex-math></inline-formula> are, in turn, a function of <inline-formula> <tex-math notation="LaTeX">$q$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$s$ </tex-math></inline-formula>. Experiencing the NNs with various <inline-formula> <tex-math notation="LaTeX">$q$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$s$ </tex-math></inline-formula> over our hardware, we approximate the energy consumption using another polynomial regression. Given the two approximators, we obtain a pair of <inline-formula> <tex-math notation="LaTeX">$q$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$s$ </tex-math></inline-formula> that minimizes the energy for a given targeted accuracy. The method was evaluated on SVHN, CIFAR-10, and ImageNet datasets trained on VGG-like and MobileNet-192 architectures, and the optimized models were deployed to Xilinx FPGAs for fully on-chip processing. The implementation results outperform the related work in terms of energy-efficiency and/or power consumption, yet having similar or higher accuracy. The proposed optimization method is fast, simple, and scalable to emerging technologies. Moreover, it can be used on top of other AutoML frameworks to maximize efficiency of running artificial intelligence on edge devices.