Cycle-consistent Adversarial Networks for Non-parallel Vocal Effort Based Speaking Style Conversion

Speaking style conversion (SSC) is the technology of converting natural speech signals from one style to another. In this study, we propose the use of cycle-consistent adversarial networks (CycleGANs) for converting styles with varying vocal effort, and focus on conversion between normal and Lombard styles as a case study of this problem. We propose a parametric approach that uses the Pulse Model in Log domain (PML) vocoder to extract speech features. These features are mapped using the CycleGAN from utterances in the source style to the corresponding features of target speech. Finally, the mapped features are converted to a Lombard speech waveform with the PML. The CycleGAN was compared in subjective listening tests with 2 other standard mapping methods used in conversion, and the CycleGAN was found to have the best performance in terms of speech quality and in terms of the magnitude of the perceptual change between the two styles.

[1]  Paavo Alku,et al.  Comparison of Gaussian process regression and Gaussian mixture models in spectral tilt modelling for intelligibility enhancement of telephone speech , 2015, INTERSPEECH.

[2]  Christophe d'Alessandro,et al.  Experiments in voice quality modification of natural speech signals: the spectral approach , 1998, SSW.

[3]  L. Braida,et al.  Speaking clearly for the hard of hearing IV: Further studies of the role of speaking rate. , 1996, Journal of speech and hearing research.

[4]  Peter F. Driessen,et al.  Transforming Perceived Vocal Effort and Breathiness Using Adaptive Pre-Emphasis Linear Prediction , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[6]  Paavo Alku,et al.  Spectral tilt modelling with GMMs for intelligibility enhancement of narrowband telephone speech , 2014, INTERSPEECH.

[7]  Koby Crammer,et al.  Non-parallel voice conversion using joint optimization of alignment by temporal context and spectral distortion , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Hirokazu Kameoka,et al.  CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[9]  Masanori Sugimoto,et al.  Whisper to normal speech conversion using pitch estimated from spectrum , 2016, Speech Commun..

[10]  Susanto Rahardja,et al.  Lombard effect mimicking , 2010, SSW.

[11]  Bajibabu Bollepalli,et al.  GlottDNN - A Full-Band Glottal Vocoder for Statistical Parametric Speech Synthesis , 2016, INTERSPEECH.

[12]  Tanja Schultz,et al.  Fundamental frequency generation for whisper-to-audible speech conversion , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Hemant A. Patil,et al.  Effectiveness of Dynamic Features in INCA and Temporal Context-INCA , 2018, INTERSPEECH.

[14]  H. Lane,et al.  The Lombard Sign and the Role of Hearing in Speech , 1971 .

[15]  John H. L. Hansen,et al.  Analysis and Compensation of Lombard Speech Across Noise Type and Levels With Application to In-Set/Out-of-Set Speaker Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Daniel Erro,et al.  INCA Algorithm for Training Voice Conversion Systems From Nonparallel Corpora , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Gaël Richard,et al.  Speech intelligibility improvement in car noise environment by voice transformation , 2017, Speech Commun..

[18]  Zhizheng Wu,et al.  Analysis of the Voice Conversion Challenge 2016 Evaluation Results , 2016, INTERSPEECH.

[19]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[20]  Àngel Calzada Defez,et al.  Vocal Effort Modification through Harmonics Plus Noise Model Representation , 2011, NOLISP.

[21]  M. Picheny,et al.  Speaking clearly for the hard of hearing. II: Acoustic characteristics of clear and conversational speech. , 1986, Journal of speech and hearing research.

[22]  Prasanta Kumar Ghosh,et al.  Whispered Speech to Neutral Speech Conversion Using Bidirectional LSTMs , 2018, INTERSPEECH.

[23]  Methods for objective and subjective assessment of quality Subjective quality evaluation of telephone services based on spoken dialogue systems , 2004 .

[24]  Mark J. F. Gales,et al.  A Log Domain Pulse Model for Parametric Speech Synthesis , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Lior Wolf,et al.  Unsupervised Cross-Domain Image Generation , 2016, ICLR.

[26]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[27]  Heming Zhao,et al.  Reconstruction of Normal Speech from Whispered Speech Based on RBF Neural Network , 2010, 2010 Third International Symposium on Intelligent Information Technology and Security Informatics.

[28]  Junichi Yamagishi,et al.  High-Quality Nonparallel Voice Conversion Based on Cycle-Consistent Adversarial Network , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[30]  Bhuvana Ramabhadran,et al.  Bias and Statistical Significance in Evaluating Speech Synthesis with Mean Opinion Scores , 2017, INTERSPEECH.

[31]  Mark A. Clements,et al.  Reconstruction of speech from whispers , 2002, MAVEBA.

[32]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[33]  Method for the subjective assessment of intermediate quality level of , 2014 .

[34]  Hemant A. Patil,et al.  Unsupervised Vocal Tract Length Warped Posterior Features for Non-Parallel Voice Conversion , 2018, INTERSPEECH.

[35]  Lauri Juvela,et al.  Speaking Style Conversion from Normal to Lombard Speech Using a Glottal Vocoder and Bayesian GMMs , 2017, INTERSPEECH.

[36]  Paavo Alku,et al.  The Use of Read versus Conversational Lombard Speech in Spectral Tilt Modeling for Intelligibility Enhancement in Near-End Noise Conditions , 2016, INTERSPEECH.

[37]  Lauri Juvela,et al.  Vocal Effort Based Speaking Style Conversion Using Vocoder Features and Parallel Learning , 2019, IEEE Access.