Improving Automatic Speech Recognition Utilizing Audio-codecs for Data Augmentation

To train end-to-end automatic speech recognition models, it requires a large amount of labeled speech data. This goal is challenging for languages with fewer resources. In contrast to the commonly used feature level data augmentation, we propose to expand the training set by using different audio codecs at the data level. The augmentation method consists of using different audio codecs with changed bit rate, sampling rate, and bit depth. The change reassures variation in the input data without drastically affecting the audio quality. Besides, we can ensure that humans still perceive the audio, and any feature extraction is possible later. To demonstrate the general applicability of the proposed augmentation technique, we evaluated it in an end-to-end automatic speech recognition architecture in four languages. After applying the method, on the Amharic, Dutch, Slovenian, and Turkish datasets, we achieved a 1.57 average improvement in the character error rates (CER) without integrating language models. The result is comparable to the baseline result, showing CER improvement of 2.78, 1.25, 1.21, and 1.05 for each language. On the Amharic dataset, we reached a syllable error rate reduction of 6.12 compared to the baseline result.

[1]  Will Song,et al.  End-to-End Deep Neural Network for Automatic Speech Recognition , 2015 .

[2]  Zafar Rafii,et al.  Lossy Audio Compression Identification , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[3]  Oliver Jokisch,et al.  Quality Assessment of Two Fullband Audio Codecs Supporting Real-Time Communication , 2016, SPECOM.

[4]  Yu Zhang,et al.  Very deep convolutional networks for end-to-end speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Yanning Zhang,et al.  Hybrid Deep Neural Network--Hidden Markov Model (DNN-HMM) Based Speech Emotion Recognition , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[6]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[7]  Francis M. Tyers,et al.  Common Voice: A Massively-Multilingual Speech Corpus , 2020, LREC.

[8]  Jan Skoglund,et al.  Summary of Opus listening test results , 2013 .

[9]  Timothy B. Terriberry,et al.  Definition of the Opus Audio Codec , 2012, RFC.

[10]  Irina Illina,et al.  New Paradigm in Speech Recognition: Deep Neural Networks , 2017, ICIS 2017.

[11]  Timothy B. Terriberry,et al.  High-Quality, Low-Delay Music Coding in the Opus Codec , 2016, ArXiv.

[12]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[13]  Shinji Watanabe,et al.  CNN-based Multichannel End-to-End Speech Recognition for Everyday Home Environments* , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[14]  Andreas Nürnberger,et al.  An Amharic Syllable-Based Speech Corpus for Continuous Speech Recognition , 2019, SLSP.

[15]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[16]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[17]  Oliver Jokisch,et al.  Review of the Opus Codec in a WebRTC Scenario for Audio and Speech Communication , 2015, SPECOM.

[18]  Geoffrey Zweig,et al.  Advances in all-neural speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Paul Foulkes,et al.  The ?Mobile Phone Effect? on Vowel Formants , 2004 .

[21]  Richard Socher,et al.  Improved Regularization Techniques for End-to-End Speech Recognition , 2017, ArXiv.

[22]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[24]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[25]  Navdeep Jaitly,et al.  Vocal Tract Length Perturbation (VTLP) improves speech recognition , 2013 .

[26]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[27]  Marina Bosi,et al.  Introduction to Digital Audio Coding and Standards , 2004, J. Electronic Imaging.

[28]  Ingo Siegert,et al.  Emotion Intelligibility within Codec-Compressed and Reduced Bandwidth Speech , 2016, ITG Symposium on Speech Communication.

[29]  Rui Zhang,et al.  Addressee and Response Selection in Multi-Party Conversations with Speaker Interaction RNNs , 2017, AAAI.

[30]  Biing-Hwang Juang,et al.  Hidden Markov Models for Speech Recognition , 1991 .

[31]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[32]  Karlheinz Brandenburg,et al.  MP3 and AAC Explained , 1999 .