Visual-to-speech conversion based on maximum likelihood estimation

This paper proposes a visual-to-speech conversion method that converts voiceless lip movements into voiced utterances without recognizing text information. Inspired by a Gaussian Mixture Model (GMM)-based voice conversion method, GMM is estimated from jointed visual and audio features and input visual features are converted to audio features using maximum likelihood estimation. In order to capture lip movements whose frame rate data is smaller than the audio data, we construct long-term image features. The proposed method has been evaluated using large-vocabulary continuous speech and experimental results show that our proposed method effectively estimates spectral envelopes and fundamental frequencies of audio speech from voiceless lip movements.

[1]  Keiichi Tokuda,et al.  HMM-based text-to-audio-visual speech synthesis , 2000, INTERSPEECH.

[2]  Satoshi Nakamura,et al.  Lip movement synthesis from speech based on hidden Markov models , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[3]  Li-Rong Dai,et al.  Joint spectral distribution modeling using restricted boltzmann machines for voice conversion , 2013, INTERSPEECH.

[4]  F. Lavagetto,et al.  Converting speech into lip movements: a multimedia telephone for hard of hearing people , 1995 .

[5]  Frank K. Soong,et al.  A minimum converted trajectory error (MCTE) approach to high quality speech-to-lips conversion , 2010, INTERSPEECH.

[6]  Tetsuya Takiguchi,et al.  Multiple Non-Negative Matrix Factorization for Many-to-Many Voice Conversion , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[8]  Tomoki Toda,et al.  Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech , 2012, Speech Commun..

[9]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[11]  T. Takiguchi,et al.  LIP-TO-SPEECH SYNTHESIS USING LOCALITY-CONSTRAINT NON-NEGATIVE MATRIX FACTORIZATION , 2015 .

[12]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[13]  Shimon Whiteson,et al.  LipNet: Sentence-level Lipreading , 2016, ArXiv.

[14]  Hideki Kawahara,et al.  STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds , 2006 .