Neutral to Lombard Speech Conversion with Deep Learning

In this paper, we propose several approaches for neutral to Lombard speech conversion. We study in particular the influence of different recurrent neural network architectures where their main hyper-parameters are carefully selected using a bandit-based approach. We also apply the Continuous Wavelet Transform (CWT) as a multi-resolution analysis framework to better model temporal dependencies of the different features selected. The speech conversion results obtained are validated by means of objective evaluations which highlight in particular the interest of the wavelet transform for the learning process.

[1]  K. Sreenivasa Rao,et al.  Conversion of neutral speech to storytelling style speech , 2015, 2015 Eighth International Conference on Advances in Pattern Recognition (ICAPR).

[2]  Haizhou Li,et al.  Fundamental frequency modeling using wavelets for emotional voice conversion , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[3]  Tomoki Toda,et al.  GMM-based voice conversion applied to emotional speech synthesis , 2003, INTERSPEECH.

[4]  Lauri Juvela,et al.  Speaking Style Conversion from Normal to Lombard Speech Using a Glottal Vocoder and Bayesian GMMs , 2017, INTERSPEECH.

[5]  Aijun Li,et al.  Prosody conversion from neutral speech to emotional speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Lauri Juvela,et al.  Vocal Effort Based Speaking Style Conversion Using Vocoder Features and Parallel Learning , 2019, IEEE Access.

[7]  C. Torrence,et al.  A Practical Guide to Wavelet Analysis. , 1998 .

[8]  Peter F. Driessen,et al.  Transforming Perceived Vocal Effort and Breathiness Using Adaptive Pre-Emphasis Linear Prediction , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Tetsuya Takiguchi,et al.  GMM-Based Emotional Voice Conversion Using Spectrum and Prosody Features , 2012 .

[10]  Susanto Rahardja,et al.  Lombard effect mimicking , 2010, SSW.

[11]  Steve C. Maddock,et al.  A corpus of audio-visual Lombard speech with frontal and profile views. , 2018, The Journal of the Acoustical Society of America.

[12]  Robert A. J. Clark,et al.  A multi-level representation of f0 using the continuous wavelet transform and the Discrete Cosine Transform , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[14]  Lauri Juvela,et al.  Cycle-consistent Adversarial Networks for Non-parallel Vocal Effort Based Speaking Style Conversion , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Hideki Kawahara,et al.  Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[17]  Haizhou Li,et al.  Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion , 2016, INTERSPEECH.

[18]  Gaël Richard,et al.  Speech intelligibility improvement in car noise environment by voice transformation , 2017, Speech Commun..