Transformation of prosody in voice conversion

Voice Conversion (VC) aims to convert one's voice to sound like that of another. So far, most of the voice conversion frameworks mainly focus only on the conversion of spectrum. We note that speaker identity is also characterized by the prosody features such as fundamental frequency (F0), energy contour and duration. Motivated by this, we propose a framework that can perform F0, energy contour and duration conversion. In the traditional exemplar-based sparse representation approach to voice conversion, a general source-target dictionary of exemplars is constructed to establish the correspondence between source and target speakers. In this work, we propose a Phonetically Aware Sparse Representation of fundamental frequency and energy contour by using Continuous Wavelet Transform (CWT). Our idea is motivated by the facts that CWT decompositions of F0 and energy contours describe prosody patterns in different temporal scales and allow for effective prosody manipulation in speech synthesis. Furthermore, phonetically aware exemplars lead to better estimation of activation matrix, therefore, possibly better conversion of prosody. We also propose a phonetically aware duration conversion framework which takes into account both phone-level and sentence-level speaking rates. We report that the proposed prosody conversion outperforms the traditional prosody conversion techniques in both objective and subjective evaluations.

[1]  Haizhou Li,et al.  Exemplar-based sparse representation of timbre and prosody for voice conversion , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  R. Srikanth,et al.  Duration modelling in voice conversion using artificial neural networks , 2012, 2012 19th International Conference on Systems, Signals and Image Processing (IWSSIP).

[3]  Yu Tsao,et al.  Locally Linear Embedding for Exemplar-Based Spectral Conversion , 2016, INTERSPEECH.

[4]  Moncef Gabbouj,et al.  Hierarchical modeling of F0 contours for voice conversion , 2014, INTERSPEECH.

[5]  Chng Eng Siong,et al.  Correlation-based frequency warping for voice conversion , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[6]  Haizhou Li,et al.  Exemplar-based voice conversion using non-negative spectrogram deconvolution , 2013, SSW.

[7]  Yi Xu SPEECH PROSODY : A METHODOLOGICAL REVIEW , 2011 .

[8]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[9]  Bin Ma,et al.  Spoken Language Recognition: From Fundamentals to Practice , 2013, Proceedings of the IEEE.

[10]  Tetsuya Takiguchi,et al.  Parallel Dictionary Learning for Multimodal Voice Conversion Using Matrix Factorization , 2016 .

[11]  Stephen DiVerdi,et al.  Cute: A concatenative method for voice conversion using exemplar-based unit selection , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[13]  Zhizheng Wu,et al.  Multidimensional scaling of systems in the Voice Conversion Challenge 2016 , 2016, SSW.

[14]  Inma Hernáez,et al.  Parametric Voice Conversion Based on Bilinear Frequency Warping Plus Amplitude Scaling , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Tuomas Virtanen,et al.  Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Masami Akamine,et al.  Multilevel parametric-base F0 model for speech synthesis , 2008, INTERSPEECH.

[17]  D. Crystal Systems of prosodic and paralinguistic features in English / by David Crystall and Randolph Quirk , 1964 .

[18]  Satoshi Nakamura,et al.  Speaker adaptation and voice conversion by codebook mapping , 1991, 1991., IEEE International Sympoisum on Circuits and Systems.

[19]  R. Patel,et al.  Acoustic characteristics of the question-statement contrast in severe dysarthria due to cerebral palsy. , 2003, Journal of speech, language, and hearing research : JSLHR.

[20]  Paul Taylor,et al.  Text-to-Speech Synthesis , 2009 .

[21]  Chung-Hsien Wu,et al.  Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Paavo Alku,et al.  Wavelets for intonation modeling in HMM speech synthesis , 2013, SSW.

[23]  Simon King,et al.  Transforming F0 contours , 2003, INTERSPEECH.

[24]  Tomoki Toda,et al.  The Voice Conversion Challenge 2016 , 2016, INTERSPEECH.

[25]  Moncef Gabbouj,et al.  Voice Conversion Using Partial Least Squares Regression , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Haizhou Li,et al.  Fundamental frequency modeling using wavelets for emotional voice conversion , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[27]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[28]  Heiga Zen,et al.  Probabilistic feature mapping based on trajectory HMMs , 2008, INTERSPEECH.

[29]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Martti Vainio,et al.  Continuous wavelet transform for analysis of speech prosody , 2013 .

[31]  Haizhou Li,et al.  Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion , 2016, INTERSPEECH.

[32]  Tetsuya Takiguchi,et al.  Voice conversion based on Non-negative matrix factorization using phoneme-categorized dictionary , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Haizhou Li,et al.  Text-independent F0 transformation with non-parallel data for voice conversion , 2010, INTERSPEECH.

[34]  Arthur R. Toth,et al.  Incorporating durational modification in voice transformation , 2008, INTERSPEECH.

[35]  Haizhou Li,et al.  Exemplar-Based Sparse Representation With Residual Compensation for Voice Conversion , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[36]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[37]  Moncef Gabbouj,et al.  Voice Conversion Using Dynamic Kernel Partial Least Squares Regression , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[38]  Yoshihiko Nankaku,et al.  Simultaneous conversion of duration and spectrum based on statistical models including time-sequence matching , 2008, INTERSPEECH.

[39]  Chng Eng Siong,et al.  High quality voice conversion using prosodic and high-resolution spectral features , 2015, Multimedia Tools and Applications.