Quality assessment of voice converted speech using articulatory features

We propose a novel application of the acoustic-to-articulatory inversion (AAI) towards a quality assessment of the voice converted speech. The ability of humans to speak effortlessly requires the coordinated movements of various articulators, muscles, etc. This effortless movement contributes towards a naturalness, intelligibility and speaker's identity (which is partially present in voice converted speech). Hence, during voice conversion (VC), the information related to the speech production is lost. In this paper, this loss is quantified for a male voice, by showing an increase in RMSE error (up to 12.7 % in tongue tip) for voice converted speech followed by showing a decrease in mutual information (I) (by 8.7 %). Similar results are obtained in the case of a female voice. This observation is extended by showing that the articulatory features can be used as an objective measure. The effectiveness of the proposed measure over MCD is illustrated by comparing their correlation with a Mean Opinion Score (MOS). Moreover, the preference score of MCD contradicted ABX test by 100 %, whereas the proposed measure supported ABX test by 45.8 % and 16.7 % in the case of female-to-male and male-to-female VC, respectively.

[1]  David Sündermann Voice Conversion : State-ofthe-Art and Future Work , 2009 .

[2]  Yannis Stylianou,et al.  Voice Transformation: A survey , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Shrikanth Narayanan,et al.  Automatic speech recognition using articulatory features from subject-independent acoustic-to-articulatory inversion. , 2011, The Journal of the Acoustical Society of America.

[4]  Jun Wang,et al.  Individual articulator's contribution to phoneme production , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Korin Richmond,et al.  Estimating articulatory parameters from the acoustic speech signal , 2002 .

[6]  P. Ladefoged A course in phonetics , 1975 .

[7]  B. Yegnanarayana,et al.  Voice conversion: Factors responsible for quality , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  H. Timothy Bunnell,et al.  Articulatory features for expressive speech synthesis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Nikos Fakotakis,et al.  Performance Evaluation for Voice Conversion Systems , 2008, TSD.

[10]  Arthur R. Toth,et al.  Using articulatory position data in voice transformation , 2007, SSW.

[11]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[12]  Olivier Rosec,et al.  Voice Conversion Using Dynamic Frequency Warping With Amplitude Scaling, for Parallel or Nonparallel Corpora , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Kiyohiro Shikano,et al.  Speech, hearing and neural network models , 1995 .

[14]  Shrikanth S. Narayanan,et al.  Speaker verification based on fusion of acoustic and articulatory information , 2013, INTERSPEECH.

[15]  John H. L. Hansen,et al.  Speaker-specific pitch contour modeling and modification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[16]  Hemant A. Patil,et al.  Fusion of magnitude and phase-based features for objective evaluation of TTS voice , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[17]  Marc Schröder,et al.  A comparison of voice conversion methods for transforming voice quality in emotional speech synthesis , 2008, INTERSPEECH.

[18]  Philip J. B. Jackson,et al.  Statistical identification of articulation constraints in the production of speech , 2009, Speech Commun..

[19]  A.E. Rosenberg,et al.  Automatic speaker verification: A review , 1976, Proceedings of the IEEE.

[20]  A. F. Machado,et al.  VOICE CONVERSION: A CRITICAL SURVEY , 2010 .

[21]  Ricardo Gutierrez-Osuna,et al.  Articulatory-based conversion of foreign accents with deep neural networks , 2015, INTERSPEECH.

[22]  H.A. Patil,et al.  On the Investigation of Spectral Resolution Problem for Identification of Female Speakers in Bengali , 2006, 2006 IEEE International Conference on Industrial Technology.

[23]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[25]  R. Kubichek,et al.  Mel-cepstral distance measure for objective speech quality assessment , 1993, Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing.

[26]  Ricardo Gutierrez-Osuna,et al.  Articulatory inversion and synthesis: Towards articulatory-based modification of speech , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Inma Hernáez,et al.  Parametric Voice Conversion Based on Bilinear Frequency Warping Plus Amplitude Scaling , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Shrikanth Narayanan,et al.  A generalized smoothness criterion for acoustic-to-articulatory inversion. , 2010, The Journal of the Acoustical Society of America.

[29]  Xavier Rodet,et al.  Objective evaluation of the Dynamic Model Selection method for spectral voice conversion , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Hemant A. Patil,et al.  Novel Amplitude Scaling method for bilinear frequency Warping-based Voice Conversion , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..