Emotional Voice Conversion Using a Hybrid Framework With Speaker-Adaptive DNN and Particle-Swarm-Optimized Neural Network

We propose a hybrid network-based learning framework for speaker-adaptive vocal emotion conversion, tested on three different datasets (languages), namely, EmoDB (German), IITKGP (Telugu), and SAVEE (English). The optimized learning model introduced is unique because of its ability to synthesize emotional speech with an acceptable perceptive quality while preserving speaker characteristics. The multilingual model is extremely beneficial in scenarios wherein emotional training data from a specific target speaker are sparsely available. The proposed model uses speaker-normalized mel-generalized cepstral coefficients for spectral training with data adaptation using the seed data from the target speaker. The fundamental frequency (F0) is transformed using a wavelet synchrosqueezed transform prior to mapping to obtain a sharpened time–frequency representation. Moreover, a feedforward artificial neural network, together with particle swarm optimization, was used for F0 training. Additionally, static-intensity modification was also performed for each test utterance. Using the framework, we were able to capture the spectral and pitch contour variabilities of emotional expression better than with other state-of-the-art methods used in this study. Considering the overall performance scores across datasets, an average melcepstral distortion (MCD) of 4.98 and root mean square error (RMSE-F0) of 10.67 were obtained in objective evaluations, and an average comparative mean opinion score (CMOS) of 3.57 and speaker similarity score of 3.70 were obtained for the proposed framework. Particularly, the best MCD of 4.09 (EmoDB-happiness) and RMSE-F0 of 9.00 (EmoDB-anger) were obtained, along with the maximum CMOS of 3.7 and speaker similarity of 4.6, thereby highlighting the effectiveness of the hybrid network model.

[1]  Douglas D. O'Shaughnessy,et al.  Investigating Speech Enhancement and Perceptual Quality for Speech Emotion Recognition , 2018, INTERSPEECH.

[2]  Antony William Rix,et al.  Perceptual evaluation of speech quality (PESQ): The new ITU standard for end-to-end speech quality a , 2002 .

[3]  Russell C. Eberhart,et al.  A new optimizer using particle swarm theory , 1995, MHS'95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science.

[4]  Shikha Tripathi,et al.  Vocal Emotion Conversion Using WSOLA and Linear Prediction , 2017, SPECOM.

[5]  Martti Vainio,et al.  Continuous wavelet transform for analysis of speech prosody , 2013 .

[6]  Shikha Tripathi,et al.  Significance of Glottal Closure Instants Detection Algorithms in Vocal Emotion Conversion , 2016, SOFA.

[7]  I A Basheer,et al.  Artificial neural networks: fundamentals, computing, design, and application. , 2000, Journal of microbiological methods.

[8]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[9]  K. Sreenivasa Rao,et al.  Designing prosody rule-set for converting neutral TTS speech to storytelling style speech for Indian languages: Bengali, Hindi and Telugu , 2014, 2014 Seventh International Conference on Contemporary Computing (IC3).

[10]  Dongxiao Niu,et al.  Research on Neural Networks Based on Culture Particle Swarm Optimization and Its Application in Power Load Forecasting , 2007, Third International Conference on Natural Computation (ICNC 2007).

[11]  Z.A. Bashir,et al.  Applying Wavelets to Short-Term Load Forecasting Using PSO-Based Neural Networks , 2009, IEEE Transactions on Power Systems.

[12]  Susmitha Vekkot,et al.  Prosodic transformation in vocal emotion conversion for multi-lingual scenarios: a pilot study , 2019, International Journal of Speech Technology.

[13]  Chung-Hsien Wu,et al.  Hierarchical Prosody Conversion Using Regression-Based Clustering for Emotional Speech Synthesis , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[15]  Susmitha Vekkot Building a generalized model for multi-lingual vocal emotion conversion , 2017, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII).

[16]  Hadas Benisty,et al.  Voice Conversion Using GMM with Enhanced Global Variance , 2011, INTERSPEECH.

[17]  S. R. Mahadeva Prasanna,et al.  Expressive speech synthesis: a review , 2013, Int. J. Speech Technol..

[18]  S. R. Mahadeva Prasanna,et al.  Neutral to Target Emotion Conversion Using Source and Suprasegmental Information , 2011, INTERSPEECH.

[19]  Lauri Juvela,et al.  Vocal Effort Based Speaking Style Conversion Using Vocoder Features and Parallel Learning , 2019, IEEE Access.

[20]  K. Sreenivasa Rao,et al.  Prosodic Mapping Using Neural Networks for Emotion Conversion in Hindi Language , 2016, Circuits Syst. Signal Process..

[21]  I. Daubechies,et al.  Synchrosqueezed wavelet transforms: An empirical mode decomposition-like tool , 2011 .

[22]  K. Sreenivasa Rao,et al.  Conversion of neutral speech to storytelling style speech , 2015, 2015 Eighth International Conference on Advances in Pattern Recognition (ICAPR).

[23]  K. P. Soman,et al.  Improved Epoch Extraction From Speech Signals Using Wavelet Synchrosqueezed Transform , 2019, 2019 National Conference on Communications (NCC).

[24]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Philip J. B. Jackson,et al.  Speaker-dependent audio-visual emotion recognition , 2009, AVSP.

[26]  Masato Akagi,et al.  Voice conversion to emotional speech based on three-layered model in dimensional approach and parameterization of dynamic features in prosody , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[27]  K. Sreenivasa Rao,et al.  Non-uniform time scale modification using instants of significant excitation and vowel onset points , 2013, Speech Commun..

[28]  Shikha Tripathi,et al.  Inter-Emotion Conversion using Dynamic Time Warping and Prosody Imposition , 2016 .

[29]  Robert A. J. Clark,et al.  A multi-level representation of f0 using the continuous wavelet transform and the Discrete Cosine Transform , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Deepa Gupta,et al.  Emotion Conversion in Telugu using Constrained Variance GMM and Continuous Wavelet Transform-$F_{0}$ , 2019, TENCON 2019 - 2019 IEEE Region 10 Conference (TENCON).

[31]  Tomoki Toda,et al.  Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech , 2012, Speech Commun..

[32]  Hong Huo,et al.  Application of Synchrosqueezed Wavelet Transforms for Extraction of the Oscillatory Parameters of Subsynchronous Oscillation in Power Systems , 2018, Energies.

[33]  Masato Akagi,et al.  Toward affective speech-to-speech translation: Strategy for emotional speech recognition and synthesis in multiple languages , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[34]  Shikha Tripathi,et al.  Enhanced speech emotion detection using deep neural networks , 2018, International Journal of Speech Technology.

[35]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[36]  Aijun Li,et al.  Prosody conversion from neutral speech to emotional speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Moncef Gabbouj,et al.  Voice Conversion Using Partial Least Squares Regression , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[38]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[39]  A. Roli Artificial Neural Networks , 2012, Lecture Notes in Computer Science.

[40]  Jiri Pribil,et al.  GMM-Based Evaluation of Emotional Style Transformation in Czech and Slovak , 2014, Cognitive Computation.

[41]  Soo-Young Lee,et al.  Emotional End-to-End Neural Speech Synthesizer , 2017, NIPS 2017.

[42]  K. Sreenivasa Rao,et al.  Modification of energy spectra, epoch parameters and prosody for emotion conversion in speech , 2017, Int. J. Speech Technol..

[43]  Paavo Alku,et al.  Wavelets for intonation modeling in HMM speech synthesis , 2013, SSW.

[44]  Shashidhar G. Koolagudi,et al.  IITKGP-SESC: Speech Database for Emotion Analysis , 2009, IC3.

[45]  Junichi Yamagishi,et al.  Emotion transplantation through adaptation in HMM-based speech synthesis , 2015, Comput. Speech Lang..

[46]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[47]  Elias Azarov,et al.  Instantaneous pitch estimation based on RAPT framework , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[48]  Hamidou Tembine,et al.  Nonparallel Emotional Speech Conversion , 2018, INTERSPEECH.

[49]  Haizhou Li,et al.  Voice conversion versus speaker verification: an overview , 2014 .

[50]  Shashidhar G. Koolagudi,et al.  Neural network based feature transformation for emotion independent speaker identification , 2012, Int. J. Speech Technol..

[51]  Ingrid Daubechies,et al.  A Nonlinear Squeezing of the Continuous Wavelet Transform Based on Auditory Nerve Models , 2017 .

[52]  Anil Kumar Vuppala,et al.  Prosody modification for speech recognition in emotionally mismatched conditions , 2018, Int. J. Speech Technol..

[53]  Tetsuya Takiguchi,et al.  Emotional Voice Conversion with Adaptive Scales F0 Based on Wavelet Transform Using Limited Amount of Emotional Data , 2017, INTERSPEECH.

[54]  Vidhyasaharan Sethu,et al.  Empirical mode decomposition based weighted frequency feature for speech-based emotion classification , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[55]  Lauri Juvela,et al.  Deep neural network based trainable voice source model for synthesis of speech with varying vocal effort , 2014, INTERSPEECH.

[56]  Tetsuya Takiguchi,et al.  GMM-Based Emotional Voice Conversion Using Spectrum and Prosody Features , 2012 .

[57]  Dirk Heylen,et al.  Generating expressive speech for storytelling applications , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[58]  Tetsuya Takiguchi,et al.  Emotional voice conversion using neural networks with arbitrary scales F0 based on wavelet transform , 2017, EURASIP Journal on Audio, Speech, and Music Processing.

[59]  Tetsuya Takiguchi,et al.  Neutral-to-emotional voice conversion with cross-wavelet transform F0 using generative adversarial networks , 2019, APSIPA Transactions on Signal and Information Processing.

[60]  R. Pradeep Reddy,et al.  Affective state recognition using audio cues , 2019, J. Intell. Fuzzy Syst..

[61]  Luís C. Oliveira,et al.  Emovoice: a system to generate emotions in speech , 2006, INTERSPEECH.

[62]  Hongwu Yang,et al.  A DNN-based emotional speech synthesis by speaker adaptation , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[63]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[64]  Steve J. Young,et al.  A system for transforming the emotion in speech: combining data-driven conversion techniques for prosody and voice quality , 2007, INTERSPEECH.

[65]  HIDEKI KAWAHARA,et al.  Technical foundations of TANDEM-STRAIGHT, a speech analysis, modification and synthesis framework , 2011 .

[66]  D. Govind,et al.  Significance of Epoch Identification Accuracy in Prosody Modification for Effective Emotion Conversion , 2018, Communications in Computer and Information Science.

[67]  Axel Röbel,et al.  Sequence-to-sequence Modelling of F0 for Speech Emotion Conversion , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[68]  S. R. Mahadeva Prasanna,et al.  Dynamic prosody modification using zero frequency filtered signal , 2013, Int. J. Speech Technol..

[69]  Haizhou Li,et al.  Fundamental frequency modeling using wavelets for emotional voice conversion , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[70]  Tomoki Toda,et al.  GMM-based voice conversion applied to emotional speech synthesis , 2003, INTERSPEECH.

[71]  Haizhou Li,et al.  Conditional restricted Boltzmann machine for voice conversion , 2013, 2013 IEEE China Summit and International Conference on Signal and Information Processing.

[72]  Tetsuya Takiguchi,et al.  Emotional voice conversion using deep neural networks with MCC and F0 features , 2016, 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS).

[73]  Yonghong Yan,et al.  High Quality Voice Conversion through Phoneme-Based Linear Mapping Functions with STRAIGHT for Mandarin , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[74]  Shikha Tripathi,et al.  Emotion detection using perceptual based speech features , 2016, 2016 IEEE Annual India Conference (INDICON).

[75]  Deepa Gupta,et al.  Hybrid Framework for Speaker-Independent Emotion Conversion Using i-Vector PLDA and Neural Network , 2019, IEEE Access.

[76]  Haizhou Li,et al.  Exemplar-based sparse representation of timbre and prosody for voice conversion , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[77]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).