论文信息 - End-to-End Whispered Speech Recognition with Frequency-Weighted Approaches and Pseudo Whisper Pre-training

End-to-End Whispered Speech Recognition with Frequency-Weighted Approaches and Pseudo Whisper Pre-training

Whispering is an important mode of human speech, but no end-to-end recognition results for it were reported yet, probably due to the scarcity of available whispered speech data. In this paper, we present several approaches for end-to-end (E2E) recognition of whispered speech considering the special characteristics of whispered speech and the scarcity of data. This includes a frequency-weighted SpecAugment policy and a frequency-divided CNN feature extractor for better capturing the high-frequency structures of whispered speech, and a layer-wise transfer learning approach to pre-train a model with normal or normal-to-whispered converted speech then fine-tune it with whispered speech to bridge the gap between whispered and normal speech. We achieve an overall relative reduction of 19.8% in PER and 44.4% in CER on a relatively small whispered TIMIT corpus. The results indicate as long as we have a good E2E model pre-trained on normal or pseudo-whispered speech, a relatively small set of whispered speech may suffice to obtain a reasonably good E2E whispered speech recognizer.

[1] Prasanta Kumar Ghosh,et al. A Study on Robustness of Articulatory Features for Automatic Speech Recognition of Neutral and Whispered Speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Dorde T. Grozdic,et al. Whispered Speech Database: Design, Processing and Application , 2013, TSD.

[3] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[4] Alex Waibel,et al. Improving Sequence-To-Sequence Speech Recognition Training with On-The-Fly Data Augmentation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] WHALETRANS: E2E WHisper to nAturaL spEech conversion using modified TRANSformer network , 2020, ArXiv.

[6] Maja Pantic,et al. Visual-Only Recognition of Normal, Whispered and Silent Speech , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Hermann Ney,et al. From Feedforward to Recurrent LSTM Neural Networks for Language Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Carlos Busso,et al. Lipreading approach for isolated digits recognition under whisper and neutral speech , 2014, INTERSPEECH.

[10] Dorde T. Grozdic,et al. Whispered Speech Recognition Using Deep Denoising Autoencoder and Inverse Filtering , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11] Fadi Biadsy,et al. Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation , 2019, INTERSPEECH.

[12] Hiroshi Sato,et al. Neural Whispered Speech Detection with Imbalanced Learning , 2019, INTERSPEECH.

[13] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Tara N. Sainath,et al. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] S. Jovicic,et al. Acoustic analysis of consonants in whispered speech. , 2008, Journal of voice : official journal of the Voice Foundation.

[16] Boon Pang Lim,et al. Transfer Learning with Bottleneck Feature Networks for Whispered Speech Recognition , 2016, INTERSPEECH.

[17] Kazuya Takeda,et al. Analysis and recognition of whispered speech , 2005, Speech Commun..

[18] Mark A. Clements,et al. Reconstruction of speech from whispers , 2002, MAVEBA.

[19] Kyu J. Han,et al. State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention with Dilated 1D Convolutions , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[20] Carlos Busso,et al. Audiovisual corpus to analyze whisper speech , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[22] Bin Ma,et al. A whispered Mandarin corpus for speech technology applications , 2014, INTERSPEECH.

[23] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[24] Rajesh M. Hegde,et al. Significance of parametric spectral ratio methods in detection and recognition of whispered speech , 2012, EURASIP J. Adv. Signal Process..

[25] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[26] Carla Teixeira Lopes,et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[27] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[28] Liang Lu,et al. Noise-robust whispered speech recognition using a non-audible-murmur microphone with VTS compensation , 2012, 2012 8th International Symposium on Chinese Spoken Language Processing.

[29] Gabriel Synnaeve,et al. Who Needs Words? Lexicon-Free Speech Recognition , 2019, INTERSPEECH.

[30] Marius Cotescu,et al. Voice Conversion for Whispered Speech Synthesis , 2020, IEEE Signal Processing Letters.

[31] Yu Zhang,et al. Very deep convolutional networks for end-to-end speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32] Myungjong Kim,et al. Recognizing Whispered Speech Produced by an Individual with Surgically Reconstructed Larynx Using Articulatory Movement Data. , 2016, Workshop on Speech and Language Processing for Assistive Technologies.

[33] John H. L. Hansen,et al. Deep neural network training for whispered speech recognition using small databases and generative model sampling , 2017, Int. J. Speech Technol..

[34] D. T. Grozdic,et al. Application of neural networks in whispered speech recognition , 2012, 2012 20th Telecommunications Forum (TELFOR).

[35] Boon Pang Lim,et al. Computational differences between whispered and non-whispered speech , 2011 .

[36] Yu Zhang,et al. Advances in Joint CTC-Attention Based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM , 2017, INTERSPEECH.

[37] Chong Wang,et al. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[38] Gabriel Synnaeve,et al. Wav2Letter: an End-to-End ConvNet-based Speech Recognition System , 2016, ArXiv.

[39] Philip Chan,et al. Toward accurate dynamic time warping in linear time and space , 2007, Intell. Data Anal..

[40] Tanja Schultz,et al. Whispery speech recognition using adapted articulatory features , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[41] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[42] Gabriel Synnaeve,et al. Letter-Based Speech Recognition with Gated ConvNets , 2017, ArXiv.

[43] John H. L. Hansen,et al. Generative Modeling of Pseudo-Whisper for Robust Whispered Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[44] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[45] Julius Kunze,et al. Transfer Learning for Speech Recognition on a Budget , 2017, Rep4NLP@ACL.