WHALETRANS: E2E WHisper to nAturaL spEech conversion using modified TRANSformer network

In this article, we investigate whispered-to natural-speech conversion method using sequence to sequence generation approach by proposing modified transformer architecture. We investigate different kinds of features such as mel frequency cepstral coefficients (MFCCs) and smoothed spectral features. The network is trained end-to-end (E2E) using supervised approach. We investigate the effectiveness of embedded auxillary decoder used after N encoder sub-layers, and is trained with the frame level objective function for identifying source phoneme labels. We predict target audio features and generate audio using these for testing. We test on standard wTIMIT dataset and CHAINS dataset. We report results as word-error-rate (WER) generated by using automatic speech recognition (ASR) system and also BLEU scores. %intelligibility and naturalness using mean opinion score and additionally using word error rate using automatic speech recognition system. In addition, we measure spectral shape of an output speech signal by measuring formant distributions w.r.t the reference speech signal, at frame level. In relation to this aspect, we also found that the whispered-to-natural converted speech formants probability distribution is closer to ground truth distribution. To the authors' best knowledge, this is the first time transformer with auxiliary decoder has been applied for whispered-to-natural speech conversion. [This pdf is TASLP submission draft version 1.0, 14th April 2020.]

[1]  Fadi Biadsy,et al.  Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation , 2019, INTERSPEECH.

[2]  S. Jovicic,et al.  Acoustic analysis of consonants in whispered speech. , 2008, Journal of voice : official journal of the Voice Foundation.

[3]  Juraj Simko,et al.  The CHAINS corpus: CHAracterizing INdividual Speakers , 2006 .

[4]  Li-Rong Dai,et al.  Sequence-to-Sequence Acoustic Modeling for Voice Conversion , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Björn W. Schuller,et al.  Exploitation of Phase-Based Features for Whispered Speech Emotion Recognition , 2016, IEEE Access.

[6]  Mo Fuyuan A linear prediction algorithm in low bit rate speech coding improved by multi-band excitation model , 2001 .

[7]  Zhao Heming,et al.  Performance analysis of mandarin whispered speech recognition based on normal speech training model , 2016, 2016 Sixth International Conference on Information Science and Technology (ICIST).

[8]  John R. Hershey,et al.  Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[9]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[10]  Hermann Ney,et al.  Improved training of end-to-end attention models for speech recognition , 2018, INTERSPEECH.

[11]  Periyasamy Paramasivam,et al.  Whisper Augmented End-to-End/Hybrid Speech Recognition System - CycleGAN Approach , 2020, INTERSPEECH.

[12]  Edouard Grave,et al.  End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures , 2019, ArXiv.

[13]  Raymond D. Kent,et al.  Speech impairment in Down syndrome: a review. , 2013, Journal of speech, language, and hearing research : JSLHR.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  John H. L. Hansen,et al.  Speaker Identification Within Whispered Speech Audio Streams , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Melvin Johnson,et al.  Direct speech-to-speech translation with a sequence-to-sequence model , 2019, INTERSPEECH.

[17]  Mark J. T. Smith,et al.  Voice conversion based on a mixture density network , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[18]  Mark A. Clements,et al.  Reconstruction of speech from whispers , 2002, MAVEBA.

[19]  N. Andersen On the calculation of filter coefficients for maximum entropy spectral analysis , 1974 .

[20]  Ian McLoughlin,et al.  Regeneration of Speech in Voice-Loss Patients , 2009 .

[21]  Wenming Zheng,et al.  Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention , 2019, IEEE Access.

[22]  Prasanta Kumar Ghosh,et al.  Whispered Speech to Neutral Speech Conversion Using Bidirectional LSTMs , 2018, INTERSPEECH.

[23]  Ian McLoughlin,et al.  Whisper-to-speech conversion using restricted Boltzmann machine arrays , 2014 .

[24]  Yan Song,et al.  Reconstruction of continuous voiced speech from whispers , 2013, INTERSPEECH.

[25]  John H. L. Hansen,et al.  Generative Modeling of Pseudo-Whisper for Robust Whispered Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Aníbal Ferreira,et al.  Implantation of voicing on whispered speech using frequency-domain parametric modelling of source and filter information , 2016, 2016 International Symposium on Signal, Image, Video and Communications (ISIVC).

[27]  Mikihiro Nakagiri,et al.  Statistical Voice Conversion Techniques for Body-Conducted Unvoiced Speech Enhancement , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Kentaro Inui,et al.  An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction , 2019, EMNLP.

[29]  Emmanuel Dupoux,et al.  Learning Word Embeddings: Unsupervised Methods for Fixed-size Representations of Variable-length Speech Segments , 2018, INTERSPEECH.

[30]  Albert Haque,et al.  Conditional End-to-End Audio Transforms , 2018, INTERSPEECH.

[31]  Ian Vince McLoughlin,et al.  Analysis-by-synthesis method for whisper-speech reconstruction , 2008, APCCAS 2008 - 2008 IEEE Asia Pacific Conference on Circuits and Systems.

[32]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[33]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[34]  Kazuya Takeda,et al.  Analysis and recognition of whispered speech , 2005, Speech Commun..

[35]  Dorde T. Grozdic,et al.  Whispered Speech Recognition Using Deep Denoising Autoencoder and Inverse Filtering , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[36]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[37]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[38]  Yan Song,et al.  Reconstruction of Phonated Speech from Whispers Using Formant-Derived Plausible Pitch Modulation , 2015, ACM Trans. Access. Comput..