A Novel Attention-Guided Generative Adversarial Network for Whisper-to-Normal Speech Conversion

Whispered speech is a special way of pronunciation without using vocal cord vibration. A whispered speech does not contain a fundamental frequency, and its energy is about 20dB lower than that of a normal speech. Converting a whispered speech into a normal speech can improve speech quality and intelligibility. In this paper, a novel attentionguided generative adversarial network model incorporating an autoencoder, a Siamese neural network, and an identity mapping loss function for whisper to normal speech conversion (AGANW2SC) is proposed. The proposed method avoids the challenge of estimating the fundamental frequency of the normal voiced speech converted from a whispered speech. Specifically, the proposed model is more amendable to practical applications because it does not need to align speech features for training. Experimental results demonstrate that the proposed AGANW2SC can obtain improved speech quality and intelligibility compared with dynamic-time-warping-based methods.

[1]  Tomoki Toda,et al.  NAM-to-speech conversion with Gaussian mixture models , 2005, INTERSPEECH.

[2]  Smita Krishnaswamy,et al.  TraVeLGAN: Image-To-Image Translation by Transformation Vector Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Wenming Zheng,et al.  Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention , 2019, IEEE Access.

[4]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[5]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[6]  Kazuya Takeda,et al.  Acoustic analysis and recognition of whispered speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Miroslaw Bober,et al.  Siamese Network of Deep Fisher-Vector Descriptors for Image Retrieval , 2017, ArXiv.

[8]  Deniz Başkent,et al.  Pitch and spectral resolution: A systematic comparison of bottom-up cues for top-down repair of degraded speech. , 2016, The Journal of the Acoustical Society of America.

[9]  Marco Pasini MelGAN-VC: Voice Conversion and Audio Style Transfer on arbitrarily long samples using Spectrograms , 2019, ArXiv.

[10]  A. Gray,et al.  Distance measures for speech processing , 1976 .

[11]  Bin Ma,et al.  Voice conversion: From spoken vowels to singing vowels , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[12]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[13]  Ian Vince McLoughlin,et al.  Reconstruction of Normal Sounding Speech for Laryngectomy Patients Through a Modified CELP Codec , 2010, IEEE Transactions on Biomedical Engineering.

[14]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[15]  J. Berger,et al.  P.563—The ITU-T Standard for Single-Ended Speech Quality Assessment , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Jesper Jensen,et al.  A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Hirokazu Kameoka,et al.  Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks , 2017, ArXiv.

[18]  Esa Rahtu,et al.  Siamese network features for image matching , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[19]  Lior Wolf,et al.  Unsupervised Cross-Domain Image Generation , 2016, ICLR.

[20]  Prasanta Kumar Ghosh,et al.  Whispered Speech to Neutral Speech Conversion Using Bidirectional LSTMs , 2018, INTERSPEECH.

[21]  Hon Keung Kwan,et al.  Multimodal Voice Conversion Under Adverse Environment Using a Deep Convolutional Neural Network , 2019, IEEE Access.

[22]  Sugato Chakravarty,et al.  Method for the subjective assessment of intermedi-ate quality levels of coding systems , 2001 .

[23]  W. Heeren Vocalic correlates of pitch in whispered versus normal speech. , 2015, The Journal of the Acoustical Society of America.

[24]  Hemant A. Patil,et al.  Effectiveness of Cross-Domain Architectures for Whisper-to-Normal Speech Conversion , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[25]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[26]  R. Kubichek,et al.  Mel-cepstral distance measure for objective speech quality assessment , 1993, Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing.