Multimodal exemplar-based voice conversion using lip features in noisy environments

This paper presents a multimodal voice conversion (VC) method for noisy environments. In our previous exemplarbased VC method, source exemplars and target exemplars are extracted from parallel training data, in which the same texts are uttered by the source and target speakers. The input source signal is then decomposed into source exemplars, noise exemplars obtained from the input signal, and their weights. Then, the converted speech is constructed from the target exemplars and the weights related to the source exemplars. In this paper, we propose a multimodal VC method that improves the noise robustness of our previous exemplar-based VC method. As visual features, we use not only conventional DCT but also the features extracted from Active Appearance Model (AAM) applied to the lip area of a face image. Furthermore, we introduce the combination weight between audio and visual features and formulate a new cost function in order to estimate the audiovisual exemplars. By using the joint audio-visual features as source features, the VC performance is improved compared to a previous audio-input exemplar-based VC method. The effectiveness of this method was confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method. Index Terms: voice conversion, multimodal, image features, non-negative matrix factorization, noise robustness

[1]  Ashish Verma,et al.  LATE INTEGRATION IN AUDIO-VISUAL CONTINUOUS SPEECH RECOGNITION , 1999 .

[2]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Tetsuya Takiguchi,et al.  GMM-Based Emotional Voice Conversion Using Spectrum and Prosody Features , 2012 .

[4]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[5]  Tomoki Toda,et al.  Eigenvoice conversion based on Gaussian mixture model , 2006, INTERSPEECH.

[6]  Eric Moulines,et al.  Voice transformation using PSOLA technique , 1991, Speech Commun..

[7]  Tuomas Virtanen,et al.  Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Tomoki Toda,et al.  Esophageal Speech Enhancement Based on Statistical Voice Conversion with Gaussian Mixture Models , 2010, IEICE Trans. Inf. Syst..

[9]  Moncef Gabbouj,et al.  Voice Conversion Using Partial Least Squares Regression , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Tetsuya Takiguchi,et al.  Multimodal voice conversion using non-negative matrix factorization in noisy environments , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Martin J. Russell,et al.  Integrating audio and visual information to provide highly robust speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[12]  Tomoki Toda,et al.  Speaking aid system for total laryngectomees using voice conversion of body transmitted artificial speech , 2006, INTERSPEECH.

[13]  Keiichi Tokuda,et al.  HMM-based text-to-audio-visual speech synthesis , 2000, INTERSPEECH.

[14]  Chung-Hsien Wu,et al.  Map-based adaptation for speech conversion using adaptation data selection and non-parallel training , 2006, INTERSPEECH.

[15]  Keikichi Hirose,et al.  One-to-Many Voice Conversion Based on Tensor Representation of Speaker Space , 2011, INTERSPEECH.

[16]  Tetsuya Takiguchi,et al.  Exemplar-based voice conversion in noisy environment , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[17]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[18]  Tomoki Toda,et al.  Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech , 2012, Speech Commun..

[19]  Shingo Kuroiwa,et al.  CENSREC-1-C: An evaluation framework for voice activity detection under noisy environments , 2009 .

[20]  Tetsuya Takiguchi,et al.  Individuality-preserving voice conversion for articulation disorders based on non-negative matrix factorization , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[22]  Nan Yang,et al.  Robust AAM-based audio-visual speech recognition against face direction changes , 2012, ACM Multimedia.

[23]  Shigeru Katagiri,et al.  ATR Japanese speech database as a tool of speech recognition and synthesis , 1990, Speech Commun..

[24]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[25]  Mikkel N. Schmidt,et al.  Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.

[26]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Xavier Rodet,et al.  Intonation Conversion from Neutral to Expressive Speech , 2011, INTERSPEECH.

[28]  Gerasimos Potamianos,et al.  Discriminative training of HMM stream exponents for audio-visual speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).