Incorporating Noise Robustness in Speech Command Recognition by Noise Augmentation of Training Data

The advent of new devices, technology, machine learning techniques, and the availability of free large speech corpora results in rapid and accurate speech recognition. In the last two decades, extensive research has been initiated by researchers and different organizations to experiment with new techniques and their applications in speech processing systems. There are several speech command based applications in the area of robotics, IoT, ubiquitous computing, and different human-computer interfaces. Various researchers have worked on enhancing the efficiency of speech command based systems and used the speech command dataset. However, none of them catered to noise in the same. Noise is one of the major challenges in any speech recognition system, as real-time noise is a very versatile and unavoidable factor that affects the performance of speech recognition systems, particularly those that have not learned the noise efficiently. We thoroughly analyse the latest trends in speech recognition and evaluate the speech command dataset on different machine learning based and deep learning based techniques. A novel technique is proposed for noise robustness by augmenting noise in training data. Our proposed technique is tested on clean and noisy data along with locally generated data and achieves much better results than existing state-of-the-art techniques, thus setting a new benchmark.

[1]  Hermann Ney,et al.  Mean-normalized stochastic gradient for large-scale deep learning , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Shruti Sannon,et al.  "Alexa is my new BFF": Social Roles, User Satisfaction, and Personification of the Amazon Echo , 2017, CHI Extended Abstracts.

[3]  Kaisheng Yao,et al.  Deep neural support vector machines for speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  I. Elamvazuthi,et al.  Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques , 2010, ArXiv.

[5]  Tara N. Sainath,et al.  Deep Convolutional Neural Networks for Large-scale Speech Tasks , 2015, Neural Networks.

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[8]  Philip C. Woodland,et al.  Very deep convolutional neural networks for robust speech recognition , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[9]  Jimmy J. Lin,et al.  Deep Residual Learning for Small-Footprint Keyword Spotting , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Sercan Ömer Arik,et al.  Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting , 2017, INTERSPEECH.

[11]  Dong Yu,et al.  An introduction to voice search , 2008, IEEE Signal Processing Magazine.

[12]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[14]  Chunlei Zhang,et al.  End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances , 2017, INTERSPEECH.

[15]  Dong Yu,et al.  Automatic Speech Recognition: A Deep Learning Approach , 2014 .

[16]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[17]  H. Ney,et al.  Linear discriminant analysis for improved large vocabulary continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Geoffrey Zweig,et al.  Personalizing Model M for Voice-Search , 2011, INTERSPEECH.

[19]  Vaibhava Goel,et al.  Advances in Very Deep Convolutional Neural Networks for LVCSR , 2016, INTERSPEECH.

[20]  Andreas G. Andreou,et al.  Investigation of silicon auditory models and generalization of linear discriminant analysis for improved speech recognition , 1997 .

[21]  Georg Heigold,et al.  Small-footprint keyword spotting using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[23]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[24]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[25]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Will Song,et al.  End-to-End Deep Neural Network for Automatic Speech Recognition , 2015 .

[27]  Taghi M. Khoshgoftaar,et al.  A survey on Image Data Augmentation for Deep Learning , 2019, Journal of Big Data.

[28]  Dae-Shik Kim,et al.  End-to-End Speech Command Recognition with Capsule Network , 2018, INTERSPEECH.

[29]  Tara N. Sainath,et al.  Convolutional neural networks for small-footprint keyword spotting , 2015, INTERSPEECH.

[30]  Nikko Strom,et al.  Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[31]  Luis A. Guerrero,et al.  Alexa vs. Siri vs. Cortana vs. Google Assistant: A Comparison of Speech-Based Natural User Interfaces , 2017 .

[32]  Geoffrey Zweig,et al.  Deep Convolutional Neural Networks with Layer-Wise Context Expansion and Attention , 2016, INTERSPEECH.

[33]  Yifan Gong,et al.  Recurrent support vector machines for speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Andreas Stolcke,et al.  The Microsoft 2017 Conversational Speech Recognition System , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Douglas Coimbra de Andrade,et al.  A neural attention model for speech command recognition , 2018, ArXiv.

[36]  Bowen Zhou,et al.  IBM MASTOR SYSTEM: Multilingual Automatic Speech-to-Speech Translator , 2006 .

[37]  Srinivasan Umesh,et al.  Improved cepstral mean and variance normalization using Bayesian framework , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[38]  Yanmin Qian,et al.  Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[39]  Brian McMahan,et al.  Listening to the World Improves Speech Command Recognition , 2017, AAAI.

[40]  Yifan Gong,et al.  An analysis of convolutional neural networks for speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Bin Ma,et al.  Joint Application of Speech and Speaker Recognition for Automation and Security in Smart Home , 2011, INTERSPEECH.

[43]  Ramesh A. Gopinath,et al.  Maximum likelihood modeling with Gaussian distributions for classification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[44]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[45]  Joseph Keshet,et al.  SpeechYOLO: Detection and Localization of Speech Objects , 2019, INTERSPEECH.

[46]  Patrick Jansson,et al.  Single-word speech recognition with Convolutional Neural Networks on raw waveforms , 2018 .

[47]  Sunil Kumar Kopparapu,et al.  Label-Driven Time-Frequency Masking for Robust Speech Command Recognition , 2019, TSD.

[48]  Olli Viikki,et al.  Cepstral domain segmental feature vector normalization for noise robust speech recognition , 1998, Speech Commun..