Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention

Whispering is a special pronunciation style in which the vocal cords do not vibrate. Compared with voiced speech, whispering is noise-like because of the lack of a fundamental frequency. The energy of whispered speech is approximately 20 dB lower than that of voiced speech. Converting whispering into normal speech is an effective way to improve speech quality and/or intelligibility. In this paper, we propose a whisper-to-normal speech conversion method based on a sequence-to-sequence framework combined with an auditory attention mechanism. The proposed method does not require time aligning before conversion training, which makes it more applicable to real scenarios. In addition, the fundamental frequency is estimated from the mel frequency cepstral coefficients estimated by the proposed sequence-to-sequence framework. The voiced speech converted by the proposed method has appropriate length, which is determined adaptively by the proposed sequence-to-sequence model according to the source whispered speech. Experimental results show that the proposed sequence-to-sequence whisper-to-normal speech conversion method outperforms conventional DTW-based methods.

[1]  R. Kubichek,et al.  Mel-cepstral distance measure for objective speech quality assessment , 1993, Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing.

[2]  Tomoki Toda,et al.  NAM-to-speech conversion with Gaussian mixture models , 2005, INTERSPEECH.

[3]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Tanja Schultz,et al.  Fundamental frequency generation for whisper-to-audible speech conversion , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[6]  Dorde T. Grozdic,et al.  Whispered Speech Recognition Using Deep Denoising Autoencoder and Inverse Filtering , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[8]  Zhao Heming,et al.  Performance analysis of mandarin whispered speech recognition based on normal speech training model , 2016, 2016 Sixth International Conference on Information Science and Technology (ICIST).

[9]  Jürgen Schmidhuber,et al.  Learning to forget: continual prediction with LSTM , 1999 .

[10]  Prasanta Kumar Ghosh,et al.  Whispered Speech to Neutral Speech Conversion Using Bidirectional LSTMs , 2018, INTERSPEECH.

[11]  Mark J. T. Smith,et al.  Voice conversion based on a mixture density network , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[12]  Mikihiro Nakagiri,et al.  Statistical Voice Conversion Techniques for Body-Conducted Unvoiced Speech Enhancement , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  John H. L. Hansen,et al.  Speaker Identification Within Whispered Speech Audio Streams , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Mo Fuyuan A linear prediction algorithm in low bit rate speech coding improved by multi-band excitation model , 2001 .

[15]  Mark A. Clements,et al.  Reconstruction of speech from whispers , 2002, MAVEBA.

[16]  Ian McLoughlin,et al.  Whisper-to-speech conversion using restricted Boltzmann machine arrays , 2014 .

[17]  Yan Song,et al.  Reconstruction of continuous voiced speech from whispers , 2013, INTERSPEECH.

[18]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[19]  M. Ramos Voice Conversion with Deep Learning , 2016 .

[20]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[21]  Ian Vince McLoughlin,et al.  Analysis-by-synthesis method for whisper-speech reconstruction , 2008, APCCAS 2008 - 2008 IEEE Asia Pacific Conference on Circuits and Systems.

[22]  John H. L. Hansen,et al.  Generative Modeling of Pseudo-Whisper for Robust Whispered Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Aníbal Ferreira,et al.  Implantation of voicing on whispered speech using frequency-domain parametric modelling of source and filter information , 2016, 2016 International Symposium on Signal, Image, Video and Communications (ISIVC).

[24]  Björn W. Schuller,et al.  Exploitation of Phase-Based Features for Whispered Speech Emotion Recognition , 2016, IEEE Access.

[25]  Ian McLoughlin,et al.  Regeneration of Speech in Voice-Loss Patients , 2009 .

[26]  Yan Song,et al.  Reconstruction of Phonated Speech from Whispers Using Formant-Derived Plausible Pitch Modulation , 2015, ACM Trans. Access. Comput..

[27]  Tomoki Toda,et al.  Silent-speech enhancement using body-conducted vocal-tract resonance signals , 2010, Speech Commun..

[28]  Jesper Jensen,et al.  A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.