On the analysis and evaluation of prosody conversion techniques

Voice conversion is a process of modifying the characteristics of source speaker such as spectrum or/and prosody, to sound as if it was spoken by another speaker. In this paper, we study the evaluation of prosody transformation, in particular, the evaluation of Fundamental Frequency (F0) conversion. F0 is an essential prosody feature that should be taken care of in a compressive voice conversion framework. So far, the evaluation of the converted prosody features is performed mainly by looking at Pearson Correlation Coefficient and Root Mean Square Error (RMSE). Unfortunately, these techniques do not explicitly measure the F0 alignment between the source and target signals. We believe that an evaluation measure that takes into account the time alignment of F0 is needed to provide a new perspective. Therefore, in this paper, we study a new technique to assess the accuracy of prosody transformation. In our experiments with different prosody transformation techniques, we report that the proposed evaluation approach achieves consistent results with the baseline evaluation metrics.

[1]  Moncef Gabbouj,et al.  Voice Conversion Using Partial Least Squares Regression , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Jacob Benesty,et al.  On the Importance of the Pearson Correlation Coefficient in Noise Reduction , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Tetsuya Takiguchi,et al.  Voice conversion based on Non-negative matrix factorization using phoneme-categorized dictionary , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Haizhou Li,et al.  Fundamental frequency modeling using wavelets for emotional voice conversion , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[5]  Tomoki Toda,et al.  The Voice Conversion Challenge 2016 , 2016, INTERSPEECH.

[6]  Haizhou Li,et al.  Text-independent F0 transformation with non-parallel data for voice conversion , 2010, INTERSPEECH.

[7]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[8]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[9]  Haizhou Li,et al.  Exemplar-based voice conversion using non-negative spectrogram deconvolution , 2013, SSW.

[10]  Haizhou Li,et al.  Exemplar-Based Sparse Representation With Residual Compensation for Voice Conversion , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Stan Salvador,et al.  FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space , 2004 .

[12]  Haizhou Li,et al.  Transformation of prosody in voice conversion , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[13]  Haizhou Li,et al.  Sparse representation of phonetic features for voice conversion with and without parallel data , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[14]  Eamonn J. Keogh,et al.  Exact indexing of dynamic time warping , 2002, Knowledge and Information Systems.

[15]  Yi Xu SPEECH PROSODY : A METHODOLOGICAL REVIEW , 2011 .

[16]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[17]  Masami Akamine,et al.  Multilevel parametric-base F0 model for speech synthesis , 2008, INTERSPEECH.

[18]  G. Huttar Relations between prosodic variables and emotions in normal American English utterances. , 1968, Journal of speech and hearing research.

[19]  Paavo Alku,et al.  Wavelets for intonation modeling in HMM speech synthesis , 2013, SSW.

[20]  Simon King,et al.  Transforming F0 contours , 2003, INTERSPEECH.

[21]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Martti Vainio,et al.  Continuous wavelet transform for analysis of speech prosody , 2013 .

[23]  B.-H. Juang,et al.  On the hidden Markov model and dynamic time warping for speech recognition — A unified view , 1984, AT&T Bell Laboratories Technical Journal.

[24]  Haizhou Li,et al.  Exemplar-based sparse representation of timbre and prosody for voice conversion , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Eamonn J. Keogh,et al.  Scaling up dynamic time warping for datamining applications , 2000, KDD '00.

[26]  Jacob Benesty,et al.  Pearson Correlation Coefficient , 2009 .

[27]  Bin Ma,et al.  Spoken Language Recognition: From Fundamentals to Practice , 2013, Proceedings of the IEEE.

[28]  Moncef Gabbouj,et al.  Hierarchical modeling of F0 contours for voice conversion , 2014, INTERSPEECH.