Automatic Speech Pronunciation Correction with Dynamic Frequency Warping-Based Spectral Conversion

This paper deals with the problem of pronunciation conversion (PC) task, a problem to reduce non-native accents in speech while preserving the original speaker identity. Although PC can be regarded as a special class of voice conversion (VC), a straightforward application of conventional VC methods to a PC task would not be successful since with VC the original speaker identity of input speech may also change. This problem is due to the fact that two functions, namely an accent conversion function and a speaker similarity conversion function, are entangled in an acoustic feature mapping function. This paper proposes dynamic frequency warping (DFW)-based spectral conversion to solve this problem. The proposed DFW-based PC converts the pronunciation of input speech by relocating the formants to the corresponding positions in which native speakers tend to locate their formants. We expect the speaker identity is preserved because other factors such as formant powers are kept unchanged. in a low frequency domain evaluation results confirmed that DFW-based PC with spectral residual modeling showed higher speaker similarity to original speaker while showing a comparable effect of reducing foreign accents to a conventional GMM-based VC method.

[1]  Kazuya Takeda,et al.  Speaker conversion through non-linear frequency warping of straight spectrum , 1999, EUROSPEECH.

[2]  Eric Moulines,et al.  Voice transformation using PSOLA technique , 1991, Speech Commun..

[3]  Ricardo Gutierrez-Osuna,et al.  Foreign Accent Conversion Through Concatenative Synthesis in the Articulatory Domain , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Eric Moulines,et al.  Statistical methods for voice quality transformation , 1995, EUROSPEECH.

[5]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Ricardo Gutierrez-Osuna,et al.  Foreign accent conversion in computer assisted pronunciation training , 2009, Speech Commun..

[7]  Olivier Rosec,et al.  Voice Conversion Using Dynamic Frequency Warping With Amplitude Scaling, for Parallel or Nonparallel Corpora , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Jonathan Harrington,et al.  An acoustic comparison between New Zealand and Australian English vowels , 1998 .

[9]  Hideki Kawahara,et al.  Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[10]  Jordan Cohen,et al.  Vocal tract normalization in speech recognition: Compensating for systematic speaker variability , 1995 .

[11]  John G. Harris,et al.  Towards an automatic foreign accent reduction tool , 2006, Speech Prosody 2006.

[12]  Joan Bruna,et al.  Voice Conversion using Convolutional Neural Networks , 2016, ArXiv.

[13]  J. Harrington,et al.  An acoustic phonetic study of broad, general, and cultivated Australian English vowels* , 1997 .

[14]  Daniel Erro,et al.  Voice Conversion Based on Weighted Frequency Warping , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Ricardo Gutierrez-Osuna,et al.  Developing Objective Measures of Foreign-Accent Conversion , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Qin Yan,et al.  A comparative analysis of UK and US English accents in recognition and synthesis , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Kun Li,et al.  Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Chng Eng Siong,et al.  Correlation-based frequency warping for voice conversion , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[19]  K. Shikano,et al.  Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[20]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[21]  Yannis Stylianou,et al.  Applying the harmonic plus noise model in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[22]  Tetsuya Takiguchi,et al.  Voice conversion using speaker-dependent conditional restricted Boltzmann machine , 2015, EURASIP Journal on Audio, Speech, and Music Processing.