Novel method for data clustering and mode selection with application in voice conversion

Since the statistical properties of speech signals are variable and depend heavily on the content, it is hard to design speech processing techniques that would perform well on all inputs. For example, in voice conversion, where the aim is to transform the speech uttered by a source speaker to sound as if it was spoken by a target speaker, different types of interspeaker relationships can be found from different types of speech segments. To tackle this problem in a robust manner, we have developed a novel scheme for data clustering and mode selection. When applied in the voice conversion application, the main idea of the proposed approach is to first cluster the target data to achieve a minimized intra-cluster variability. Then, a mode selector or a classifier is trained on aligned source-related data to recognize the target-based clusters. Auxiliary speech features can be used to enhance the classification accuracy, in addition to the source data. Finally, a separate conversion scheme is trained and used for each cluster. The proposed scheme is fully data-driven and it avoids the need to use heuristic solutions. The superior performance of the proposed scheme has been verified in a practical voice conversion system.

[1]  Hermann Ney,et al.  Voice Conversion Using Exclusively Unaligned Training Data , 2004, Proces. del Leng. Natural.

[2]  Yoshihisa Ishida,et al.  Transformation of spectral envelope for voice conversion based on radial basis function networks , 2002, INTERSPEECH.

[3]  Antonio Bonafonte,et al.  Including dynamic and phonetic information in voice conversion systems , 2004, INTERSPEECH.

[4]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[5]  Bayya Yegnanarayana,et al.  Transformation of formants for voice conversion using artificial neural networks , 1995, Speech Commun..

[6]  Ashish Verma,et al.  Using phone and diphone based acoustic models for voice conversion: a step towards creating voice fonts , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[7]  Wei Zhang,et al.  A Hybrid GMM and Codebook Mapping Method for Spectral Conversion , 2005, ACII.

[8]  Yuezhong Tang,et al.  A Parametric Approach for Voice Conversion , 2006 .

[9]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10]  Hui Ye,et al.  Perceptually weighted linear transformations for voice conversion , 2003, INTERSPEECH.

[11]  Yung-Hwan Oh,et al.  Hidden Markov model based voice conversion using dynamic characteristics of speaker , 1997, EUROSPEECH.

[12]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[13]  Levent M. Arslan,et al.  Voice conversion by codebook mapping of line spectral frequencies and excitation spectrum , 1997, EUROSPEECH.

[14]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).