Multi-objective Recurrent Neural Networks Optimization for the Edge - a Quantization-based Approach

The compression of deep learning models is of fundamental importance in deploying such models to edge devices. Incorporating hardware model and application constraints during compression maximizes the benefits but makes it specifically designed for one case. Therefore, the compression needs to be automated to meet changes in the hardware platform and application. Searching for the optimal compression method parameters is considered an optimization problem. This article introduces a Multi-Objective Hardware-Aware Quantization (MOHAQ) method, which considers both hardware efficiency and inference error as objectives for mixed-precision quantization. The proposed method makes the evaluation of candidate solutions in a large search space feasible by relying on two steps. First, post-training quantization is applied for fast solution evaluation (inference-only search). Second, we propose a novel search technique named "beacon-based search" to retrain selected solutions only in the search space and use them as beacons to know the effect of retraining on neighbor solutions. To evaluate the optimization potential, we chose a speech recognition model using the TIMIT dataset. The model is based on Simple Recurrent Unit (SRU) due to its considerable speedup over other recurrent units. We applied our method to run on two platforms: SiLago and Bitfusion. We provide experimental evaluations showing that SRU can be compressed up to 8x by post-training quantization without any significant increase in the error and up to 12x with only a 1.5 percentage point increase in error. On SiLago, the inference-only search found solutions that achieve 80% and 64% of the maximum possible speedup and energy saving, respectively, with a 0.5 percentage point increase in the error. On Bitfusion, with a constraint of a small SRAM size of 2MB, beacon-based search reduced the error gain of inference-only search by 4 percentage points and increased the possible reached speedup to be 47x compared to the Bitfusion baseline.

[1]  Ahmed Hemani,et al.  MOCHA: Morphable Locality and Compression Aware Architecture for Convolutional Neural Networks , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[2]  Tianqi Chen,et al.  Net2Net: Accelerating Learning via Knowledge Transfer , 2015, ICLR.

[3]  Syed M. A. H. Jafri,et al.  The SiLago Solution: Architecture and Design Methods for a Heterogeneous Dark Silicon Aware Coarse Grain Reconfigurable Fabric , 2017 .

[4]  Zhongyang Zheng,et al.  Research Advance in Swarm Robotics , 2013 .

[5]  Tarek F. Abdelzaher,et al.  DeepIoT: Compressing Deep Neural Network Structures for Sensing Systems with a Compressor-Critic Framework , 2017, SenSys.

[6]  Bo Chen,et al.  NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications , 2018, ECCV.

[7]  Zhiru Zhang,et al.  Improving Neural Network Quantization without Retraining using Outlier Channel Splitting , 2019, ICML.

[8]  Kalyanmoy Deb,et al.  Pymoo: Multi-Objective Optimization in Python , 2020, IEEE Access.

[9]  Madhura Purnaprajna,et al.  Recurrent Neural Networks: An Embedded Computing Perspective , 2019, IEEE Access.

[10]  David E. Goldberg,et al.  A niched Pareto genetic algorithm for multiobjective optimization , 1994, Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence.

[11]  Yu Zhang,et al.  Simple Recurrent Units for Highly Parallelizable Recurrence , 2017, EMNLP.

[12]  Guillaume Gravier,et al.  A Step Beyond Local Observations with a Dialog Aware Bidirectional GRU Network for Spoken Language Understanding , 2016, INTERSPEECH.

[13]  Avi Mendelson,et al.  Loss Aware Post-training Quantization , 2019, ArXiv.

[14]  Chan Mo Kim,et al.  Multiplier design based on ancient Indian Vedic Mathematics , 2008, 2008 International SoC Design Conference.

[15]  Ji Liu,et al.  End-to-End Learning of Energy-Constrained Deep Neural Networks , 2018, ArXiv.

[16]  Vivienne Sze,et al.  Designing Energy-Efficient Convolutional Neural Networks Using Energy-Aware Pruning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[18]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[19]  Patrick Judd,et al.  Stripes: Bit-serial deep neural network computing , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20]  Kalyanmoy Deb,et al.  MULTI-OBJECTIVE FUNCTION OPTIMIZATION USING NON-DOMINATED SORTING GENETIC ALGORITHMS , 1994 .

[21]  Titouan Parcollet,et al.  The Pytorch-kaldi Speech Recognition Toolkit , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Dejan S. Milojicic,et al.  PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference , 2019, ASPLOS.

[23]  Kenneth Heafield,et al.  Neural Machine Translation with 4-Bit Precision and Beyond , 2019, ArXiv.

[24]  Vijay Kumar,et al.  A review on genetic algorithm: past, present, and future , 2020, Multimedia tools and applications.

[25]  Rana Ali Amjad,et al.  Up or Down? Adaptive Rounding for Post-Training Quantization , 2020, ICML.

[26]  Zhijian Liu,et al.  HAQ: Hardware-Aware Automated Quantization With Mixed Precision , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Niraj K. Jha,et al.  NeST: A Neural Network Synthesis Tool Based on a Grow-and-Prune Paradigm , 2017, IEEE Transactions on Computers.

[28]  Yousra Alkabani,et al.  A distributed genetic algorithm for swarm robots obstacle avoidance , 2014, 2014 9th International Conference on Computer Engineering & Systems (ICCES).

[29]  Daniel Soudry,et al.  Post training 4-bit quantization of convolutional networks for rapid-deployment , 2018, NeurIPS.

[30]  Liang Qiao,et al.  Optimizing Speech Recognition For The Edge , 2019, ArXiv.

[31]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Christos-Savvas Bouganis,et al.  Approximate FPGA-based LSTMs under Computation Time Constraints , 2018, ARC.

[33]  Suyog Gupta,et al.  To prune, or not to prune: exploring the efficacy of pruning for model compression , 2017, ICLR.

[34]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[35]  Ahmed Hemani,et al.  Parallel distributed scalable runtime address generation scheme for a coarse grain reconfigurable computation and storage fabric , 2014, Microprocess. Microsystems.

[36]  Hannu Tenhunen,et al.  Private configuration environments (PCE) for efficient reconfiguration, in CGRAs , 2013, 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors.

[37]  Hoi-Jun Yoo,et al.  UNPU: An Energy-Efficient Deep Neural Network Accelerator With Fully Variable Weight Bit Precision , 2019, IEEE Journal of Solid-State Circuits.

[38]  David Thorsley,et al.  Post-training Piecewise Linear Quantization for Deep Neural Networks , 2020, ECCV.

[39]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[40]  Dragan Savic,et al.  Single-objective vs. Multiobjective Optimisation for Integrated Decision Support , 2002 .

[41]  Ji Li,et al.  Image describing based on bidirectional LSTM and improved sequence sampling , 2017, 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA)(.

[42]  Kurt Keutzer,et al.  ZeroQ: A Novel Zero Shot Quantization Framework , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Yu Zhang,et al.  Training RNNs as Fast as CNNs , 2017, EMNLP 2018.

[44]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[45]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[46]  Peeter Ellervee,et al.  TransMem: A memory architecture to support dynamic remapping and parallelism in low power high performance CGRAs , 2016, 2016 26th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS).

[47]  Hadi Esmaeilzadeh,et al.  Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network , 2017, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[48]  Wonyong Sung,et al.  Resiliency of Deep Neural Networks under Quantization , 2015, ArXiv.

[49]  Qinru Qiu,et al.  C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs , 2018, FPGA.

[50]  Ahmed Hemani,et al.  Partially reconfigurable interconnection network for dynamically reprogrammable resource array , 2009, 2009 IEEE 8th International Conference on ASIC.

[51]  Maurizio Martina,et al.  NACU: A Non-Linear Arithmetic Unit for Neural Networks , 2020, 2020 57th ACM/IEEE Design Automation Conference (DAC).

[52]  Norbert Wehn,et al.  FINN-L: Library Extensions and Design Trade-Off Analysis for Variable Precision LSTM Networks on FPGAs , 2018, 2018 28th International Conference on Field Programmable Logic and Applications (FPL).

[53]  Azlan Mohd Zain,et al.  Overview of NSGA-II for Optimizing Machining Process Parameters , 2011 .