Exploiting Hybrid Models of Tensor-Train Networks For Spoken Command Recognition

This work aims to design a low complexity spoken command recognition (SCR) system by considering different trade-offs between the number of model parameters and classification accuracy. More specifically, we exploit a deep hybrid architecture of a tensor-train (TT) network to build an end-to-end SRC pipeline. Our command recognition system, namely CNN+(TT-DNN), is composed of convolutional layers at the bottom for spectral feature extraction and TT layers at the top for command classification. Compared with a traditional end-to-end CNN baseline for SCR, our proposed CNN+(TTDNN) model replaces fully connected (FC) layers with TT ones and it can substantially reduce the number of model parameters while maintaining the baseline performance of the CNN model. We initialize the CNN+(TT-DNN) model in a randomized manner or based on a well-trained CNN+DNN, and assess the CNN+(TT-DNN) models on the Google Speech Command Dataset. Our experimental results show that the proposed CNN+(TT-DNN) model attains a competitive accuracy of 96.31% with 4 times fewer model parameters than the CNN model. Furthermore, the CNN+(TT-DNN) model can obtain a 97.2% accuracy when the number of parameters is increased.

[1]  Yann LeCun,et al.  Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.

[2]  Miguel Tairum Cruz,et al.  Keyword Transformer: A Self-Attention Model for Keyword Spotting , 2021, Interspeech 2021.

[3]  Chin-Hui Lee,et al.  Decentralizing Feature Extraction with Quantum Convolutional Neural Network for Automatic Speech Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Chunhua Deng,et al.  TIE: Energy-efficient Tensor Train-based Inference Engine for Deep Neural Network , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[5]  Seungjin Choi,et al.  Nonnegative Tucker Decomposition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[7]  Nikos D. Sidiropoulos,et al.  Tensor Decomposition for Signal Processing and Machine Learning , 2016, IEEE Transactions on Signal Processing.

[8]  Andrzej Cichocki,et al.  PARAFAC algorithms for large-scale problems , 2011, Neurocomputing.

[9]  Alexander Novikov,et al.  Tensorizing Neural Networks , 2015, NIPS.

[10]  Jun Du,et al.  A Theory on Deep Neural Network Based Vector-to-Vector Regression With an Illustration of Its Expressive Power in Speech Enhancement , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Gintare Karolina Dziugaite,et al.  Stabilizing the Lottery Ticket Hypothesis , 2019 .

[12]  Brian McMahan,et al.  Listening to the World Improves Speech Command Recognition , 2017, AAAI.

[13]  Gilad Yehudai,et al.  Proving the Lottery Ticket Hypothesis: Pruning is All You Need , 2020, ICML.

[14]  Ivan Oseledets,et al.  Tensor-Train Decomposition , 2011, SIAM J. Sci. Comput..

[15]  Chin-Hui Lee,et al.  Characterizing Speech Adversarial Examples Using Self-Attention U-Net Enhancement , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Volker Tresp,et al.  Tensor-Train Recurrent Neural Networks for Video Classification , 2017, ICML.

[17]  Heung-Seon Oh,et al.  Wav2KWS: Transfer Learning From Speech Representations for Keyword Spotting , 2021, IEEE Access.

[18]  Mingjie Sun,et al.  Rethinking the Value of Network Pruning , 2018, ICLR.

[19]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Suyog Gupta,et al.  To prune, or not to prune: exploring the efficacy of pruning for model compression , 2017, ICLR.

[21]  Rasmus Bro,et al.  Recent developments in CANDECOMP/PARAFAC algorithms: a critical review , 2003 .

[22]  Vyacheslav V. Lyashenko,et al.  Speech Recognition Systems: A Comparative Review , 2017 .

[23]  Chin-Hui Lee,et al.  Analyzing Upper Bounds on Mean Absolute Errors for Deep Neural Network-Based Vector-to-Vector Regression , 2020, IEEE Transactions on Signal Processing.

[24]  Douglas Coimbra de Andrade,et al.  A neural attention model for speech command recognition , 2018, ArXiv.

[25]  Chin-Hui Lee,et al.  Exploring Deep Hybrid Tensor-to-Vector Network Architectures for Regression Based Speech Enhancement , 2020, INTERSPEECH.

[26]  Chin-Hui Lee,et al.  On Mean Absolute Error for Deep Neural Network Based Vector-to-Vector Regression , 2020, IEEE Signal Processing Letters.