MOHAQ: Multi-Objective Hardware-Aware Quantization of recurrent neural networks

The compression of deep learning models is of fundamental importance in deploying such models to edge devices. The selection of compression parameters can be automated to meet changes in the hardware platform and application using optimization algorithms. This article introduces a Multi-Objective Hardware-Aware Quantization (MOHAQ) method, which considers hardware efficiency and inference error as objectives for mixed-precision quantization. The proposed method feasibly evaluates candidate solutions in a large search space by relying on two steps. First, post-training quantization is applied for fast solution evaluation (inference-only search). Second, we propose the ”beacon-based search” to retrain selected solutions only and use them as beacons to know the effect of retraining on other solutions. We use a speech recognition model based on Simple Recurrent Unit (SRU) using the TIMIT dataset and apply our method to run on SiLago and Bitfusion platforms. We provide experimental evaluations showing that SRU can be compressed up to 8x by post-training quantization without any significant error increase. On SiLago, we found solutions that achieve 97% and 86% of the maximum possible speedup and energy saving, with a minor increase in error. On Bitfusion, beacon-based search reduced the error gain of inference-only search by up to 4.9 percentage points.

[1]  Niraj K. Jha,et al.  NeST: A Neural Network Synthesis Tool Based on a Grow-and-Prune Paradigm , 2017, IEEE Transactions on Computers.

[2]  Bo Chen,et al.  NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications , 2018, ECCV.

[3]  Kalyanmoy Deb,et al.  Pymoo: Multi-Objective Optimization in Python , 2020, IEEE Access.

[4]  Yu Zhang,et al.  Training RNNs as Fast as CNNs , 2017, EMNLP 2018.

[5]  Wonyong Sung,et al.  Resiliency of Deep Neural Networks under Quantization , 2015, ArXiv.

[6]  Qinru Qiu,et al.  C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs , 2018, FPGA.

[7]  Ahmed Hemani,et al.  Partially reconfigurable interconnection network for dynamically reprogrammable resource array , 2009, 2009 IEEE 8th International Conference on ASIC.

[8]  Ji Li,et al.  Image describing based on bidirectional LSTM and improved sequence sampling , 2017, 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA)(.

[9]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[10]  Syed M. A. H. Jafri,et al.  The SiLago Solution: Architecture and Design Methods for a Heterogeneous Dark Silicon Aware Coarse Grain Reconfigurable Fabric , 2017 .

[11]  Rana Ali Amjad,et al.  Up or Down? Adaptive Rounding for Post-Training Quantization , 2020, ICML.

[12]  Zhijian Liu,et al.  HAQ: Hardware-Aware Automated Quantization With Mixed Precision , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Titouan Parcollet,et al.  The Pytorch-kaldi Speech Recognition Toolkit , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Daniel Soudry,et al.  Post training 4-bit quantization of convolutional networks for rapid-deployment , 2018, NeurIPS.

[15]  Liang Qiao,et al.  Optimizing Speech Recognition For The Edge , 2019, ArXiv.

[16]  Christos-Savvas Bouganis,et al.  Approximate FPGA-based LSTMs under Computation Time Constraints , 2018, ARC.

[17]  Kenneth Heafield,et al.  Neural Machine Translation with 4-Bit Precision and Beyond , 2019, ArXiv.

[18]  Vivienne Sze,et al.  Designing Energy-Efficient Convolutional Neural Networks Using Energy-Aware Pruning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Zhiru Zhang,et al.  Improving Neural Network Quantization without Retraining using Outlier Channel Splitting , 2019, ICML.

[20]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[21]  David E. Goldberg,et al.  A niched Pareto genetic algorithm for multiobjective optimization , 1994, Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence.

[22]  Peeter Ellervee,et al.  TransMem: A memory architecture to support dynamic remapping and parallelism in low power high performance CGRAs , 2016, 2016 26th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS).

[23]  Hadi Esmaeilzadeh,et al.  Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network , 2017, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[24]  Avi Mendelson,et al.  Loss Aware Post-training Quantization , 2019, ArXiv.

[25]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[26]  Kurt Keutzer,et al.  ZeroQ: A Novel Zero Shot Quantization Framework , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Maurizio Martina,et al.  NACU: A Non-Linear Arithmetic Unit for Neural Networks , 2020, 2020 57th ACM/IEEE Design Automation Conference (DAC).

[28]  Vijay Kumar,et al.  A review on genetic algorithm: past, present, and future , 2020, Multimedia tools and applications.

[29]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[30]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[31]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[32]  Dragan Savic,et al.  Single-objective vs. Multiobjective Optimisation for Integrated Decision Support , 2002 .

[33]  Chan Mo Kim,et al.  Multiplier design based on ancient Indian Vedic Mathematics , 2008, 2008 International SoC Design Conference.

[34]  Yousra Alkabani,et al.  A distributed genetic algorithm for swarm robots obstacle avoidance , 2014, 2014 9th International Conference on Computer Engineering & Systems (ICCES).

[35]  Hoi-Jun Yoo,et al.  UNPU: An Energy-Efficient Deep Neural Network Accelerator With Fully Variable Weight Bit Precision , 2019, IEEE Journal of Solid-State Circuits.

[36]  Zhongyang Zheng,et al.  Research Advance in Swarm Robotics , 2013 .

[37]  David Thorsley,et al.  Post-training Piecewise Linear Quantization for Deep Neural Networks , 2020, ECCV.

[38]  Norbert Wehn,et al.  FINN-L: Library Extensions and Design Trade-Off Analysis for Variable Precision LSTM Networks on FPGAs , 2018, 2018 28th International Conference on Field Programmable Logic and Applications (FPL).

[39]  Azlan Mohd Zain,et al.  Overview of NSGA-II for Optimizing Machining Process Parameters , 2011 .

[40]  Kalyanmoy Deb,et al.  Muiltiobjective Optimization Using Nondominated Sorting in Genetic Algorithms , 1994, Evolutionary Computation.

[41]  Tarek F. Abdelzaher,et al.  DeepIoT: Compressing Deep Neural Network Structures for Sensing Systems with a Compressor-Critic Framework , 2017, SenSys.

[42]  Madhura Purnaprajna,et al.  Recurrent Neural Networks: An Embedded Computing Perspective , 2019, IEEE Access.