STRAIGHT-Based Emotion Conversion Using Quadratic Multivariate Polynomial

Speech is the natural mode of communication and the easiest way of expressing human emotions. Emotional speech is expressed in terms of features like f0 contour, intensity, speaking rate, and voice quality. The group of these features is called prosody. Generally, prosody is modified by pitch and time scaling. Emotional speech conversion is more sensitive to prosody unlike voice conversion, where spectral conversion is the main concern. Several techniques, linear as well as nonlinear, have been used for transforming the speech. Our hypothesis is that quality of emotional speech conversion can be improved by estimating nonlinear relationship between the neutral and emotional speech feature vectors. In this research work, quadratic multivariate polynomial (QMP) has been explored for transforming neutral speech to emotional target speech. Both subjective and objective analyses were carried out to evaluate the transformed emotional speech using comparison mean opinion scores (CMOS), mean opinion scores (MOS), identification rate, root-mean-square error, and Mahalanobis distance. For Toronto emotional database, except for neutral/sad conversion, the CMOS analysis indicates that the transformed speech can partly be perceived as target emotion. Moreover, the MOS and spectrogram indicate good quality of transformed speech. For German database except for neutral/boredom conversion, the CMOS value of proposed technique has better score than gross and initial–middle–final methods but less than syllable method. However, QMP technique is simple, is easy to implement, has better quality of transformed speech, and estimates transformation function using limited number of utterances of training set.

[1]  Ashish Panat,et al.  Emotion transformation from neutral to 3 emotions of speech signal using DWT and adaptive filtering techniques , 2014, 2014 Annual IEEE India Conference (INDICON).

[2]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[3]  Mihir Narayan Mohanty,et al.  Efficient feature combination techniques for emotional speech classification , 2016, Int. J. Speech Technol..

[4]  Zhi Zheng Wu,et al.  Spectral mapping for voice conversion , 2015 .

[5]  K. Shikano,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[6]  Carlos Dias Maciel,et al.  A neural-wavelet architecture for voice conversion , 2007, Neurocomputing.

[7]  Daniel Erro,et al.  Voice Conversion Based on Weighted Frequency Warping , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Xiao Qing Yu,et al.  Emotional Analysis and Synthesis of Human Voice Based on STRAIGHT , 2014, ICRA 2014.

[9]  HIDEKI KAWAHARA,et al.  Technical foundations of TANDEM-STRAIGHT, a speech analysis, modification and synthesis framework , 2011 .

[10]  Tomohiro Nakatani,et al.  Evaluation of a speech recognition / generation method based on HMM and straight , 2002, INTERSPEECH.

[11]  Jia Liu,et al.  Voice conversion with smoothed GMM and MAP adaptation , 2003, INTERSPEECH.

[12]  Moncef Gabbouj,et al.  Voice Conversion Using Partial Least Squares Regression , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[14]  Jang Bahadur Singh,et al.  EFFECT OF MFCC BASED FEATURES FOR SPEECH SIGNAL ALIGNMENT S , 2013 .

[15]  Hermann Ney,et al.  A study on residual prediction techniques for voice conversion , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[16]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[17]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[18]  Tetsuya Takiguchi,et al.  Emotional Voice Conversion Using Neural Networks with Different Temporal Scales of F0 based on Wavelet Transform , 2016, SSW.

[19]  Taoufik En-Najjary,et al.  A voice conversion method based on joint pitch and spectral envelope transformation , 2004, INTERSPEECH.

[20]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[21]  Tetsuya Takiguchi,et al.  Emotional voice conversion using deep neural networks with MCC and F0 features , 2016, 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS).

[22]  Tetsuya Takiguchi,et al.  GMM-Based Emotional Voice Conversion Using Spectrum and Prosody Features , 2012 .

[23]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[24]  Eric Moulines,et al.  Voice transformation using PSOLA technique , 1991, Speech Commun..

[25]  Ken-ichi Funahashi,et al.  On the approximate realization of continuous mappings by neural networks , 1989, Neural Networks.

[26]  Klaus R. Scherer,et al.  Vocal communication of emotion: A review of research paradigms , 2003, Speech Commun..

[27]  Hiroya Fujisaki,et al.  Information, prosody, and modeling - with emphasis on tonal features of speech - , 2004, Speech Prosody 2004.

[28]  K. Sreenivasa Rao,et al.  Analysis and modification of spectral energy for neutral to sad emotion conversion , 2015, 2015 Eighth International Conference on Contemporary Computing (IC3).

[29]  W. J. Holmes,et al.  Extension of the bandwidth of the JSRU parallel-formant synthesizer for high quality synthesis of male and female speech , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[30]  Nick Campbell,et al.  A Speech Synthesis System with Emotion for Assisting Communication , 2000 .

[31]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[32]  Axel Röbel,et al.  Extending efficient spectral envelope modeling to Mel-frequency based representation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  Levent M. Arslan,et al.  Voice conversion methods for vocal tract and pitch contour modification , 2003, INTERSPEECH.

[34]  Marc Schröder,et al.  A comparison of voice conversion methods for transforming voice quality in emotional speech synthesis , 2008, INTERSPEECH.

[35]  Yonghong Yan,et al.  High Quality Voice Conversion through Phoneme-Based Linear Mapping Functions with STRAIGHT for Mandarin , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[36]  P. C. Pandey,et al.  Transformation of short-term spectral envelope of speech signal using multivariate polynomial modeling , 2011, 2011 National Conference on Communications (NCC).

[37]  Tetsuya Takiguchi,et al.  Exemplar-based emotional voice conversion using non-negative matrix factorization , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[38]  Jagannath H. Nirmal,et al.  Voice conversion using General Regression Neural Network , 2014, Appl. Soft Comput..

[39]  Shigeo Morishima,et al.  Perceptual similarity measurement of speech by combination of acoustic features , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[40]  Haizhou Li,et al.  Generating emotional speech from neutral speech , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[41]  Kishore Prahallad,et al.  Spectral Mapping Using Artificial Neural Networks for Voice Conversion , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[42]  K. Sreenivasa Rao,et al.  Prosodic Mapping Using Neural Networks for Emotion Conversion in Hindi Language , 2016, Circuits Syst. Signal Process..

[43]  Keiichi Tokuda,et al.  Voice characteristics conversion for HMM-based speech synthesis system , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[44]  Marc Schröder,et al.  Emotional speech synthesis: a review , 2001, INTERSPEECH.

[45]  K. Sreenivasa Rao,et al.  Modification of energy spectra, epoch parameters and prosody for emotion conversion in speech , 2017, Int. J. Speech Technol..

[46]  Xavier Rodet,et al.  Intonation Conversion from Neutral to Expressive Speech , 2011, INTERSPEECH.

[47]  Carlos Busso,et al.  Investigating the role of phoneme-level modifications in emotional speech resynthesis , 2005, INTERSPEECH.

[48]  Athanasios Mouchtaris,et al.  Multichannel audio synthesis by subband-based spectral conversion and parameter adaptation , 2005, IEEE Transactions on Speech and Audio Processing.

[49]  Dong Yu,et al.  Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[50]  W. Sendlmeier,et al.  Verification of acoustical correlates of emotional speech using formant-synthesis , 2000 .

[51]  S. R. Mahadeva Prasanna,et al.  Neutral to Target Emotion Conversion Using Source and Suprasegmental Information , 2011, INTERSPEECH.

[52]  Tetsuya Takiguchi,et al.  Voice conversion in high-order eigen space using deep belief nets , 2013, INTERSPEECH.

[53]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[54]  Oytun Türk,et al.  CROSS-LINGUAL VOICE CONVERSION , 2007 .

[55]  K. Shikano,et al.  Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).