An HMM-based speech-to-video synthesizer

Emerging broadband communication systems promise a future of multimedia telephony, e.g. the addition of visual information to telephone conversations. It is useful to consider the problem of generating the critical information useful for speechreading, based on existing narrowband communications systems used for speech. This paper focuses on the problem of synthesizing visual articulatory movements given the acoustic speech signal. In this application, the acoustic speech signal is analyzed and the corresponding articulatory movements are synthesized for speechreading. This paper describes a hidden Markov model (HMM)-based visual speech synthesizer. The key elements in the application of HMMs to this problem are the decomposition of the overall modeling task into key stages and the judicious determination of the observation vector's components for each stage. The main contribution of this paper is a novel correlation HMM model that is able to integrate independently trained acoustic and visual HMMs for speech-to-visual synthesis. This model allows increased flexibility in choosing model topologies for the acoustic and visual HMMs. Moreover the propose model reduces the amount of training data compared to early integration modeling techniques. Results from objective experiments analysis show that the propose approach can reduce time alignment errors by 37.4% compared to conventional temporal scaling method. Furthermore, subjective results indicated that the purpose model can increase speech understanding.

[1]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[2]  A A Montgomery,et al.  Auditory and visual contributions to the perception of consonants. , 1974, Journal of speech and hearing research.

[3]  Rachel E. Stark Sensory capabilities of hearing-impaired children;: Based on the proceedings of a workshop, Baltimore, Maryland, October 26-27, 1973 , 1974 .

[4]  A. Montgomery,et al.  Visual intelligibility of consonants: a lipreading screening test with implications for aural rehabilitation. , 1976, The Journal of speech and hearing disorders.

[5]  B. Walden,et al.  Effects of training on the visual recognition of consonants. , 1977, Journal of speech and hearing research.

[6]  Q Summerfield,et al.  Use of Visual Information for Phonetic Perception , 1979, Phonetica.

[7]  B E Walden,et al.  Some effects of training on speech recognition by hearing-impaired adults. , 1981, Journal of speech and hearing research.

[8]  E. Owens,et al.  Visemes observed by hearing-impaired and normal-hearing adult viewers. , 1985, Journal of speech and hearing research.

[9]  S A Lesner,et al.  Training influences on visual consonant and sentence recognition. , 1987, Ear and hearing.

[10]  Louis D. Braida,et al.  Evaluating the articulation index for auditory-visual input. , 1987, The Journal of the Acoustical Society of America.

[11]  M Terry,et al.  Telephone usage in the hearing-impaired population. , 1992, Ear and hearing.

[12]  Q. Summerfield,et al.  Lipreading and audio-visual speech perception. , 1992, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[13]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[14]  M. M. Cohen,et al.  What can visual speech synthesis tell visual speech recognition? , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[15]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[16]  David G. Stork,et al.  Speechreading: an overview of image processing, feature extraction, sensory integration and pattern recognition techniques , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[17]  C. Benoit Synthesis and automatic recognition of audio-visual speech , 1996 .

[18]  Martin J. Russell,et al.  Integrating audio and visual information to provide highly robust speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[19]  Stephen Cox,et al.  Scale based features for audiovisual speech recognition , 1996 .

[20]  D. Massaro Perceiving talking faces: from speech perception to a behavioral principle , 1999 .

[21]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[22]  Jenq-Neng Hwang,et al.  Robust speech recognition based on joint model and feature space optimization of hidden Markov models , 1997, IEEE Trans. Neural Networks.

[23]  Tsuhan Chen,et al.  Audio-visual interaction in multimedia communication , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[24]  Satoshi Nakamura,et al.  Speech-to-lip movement synthesis based on the EM algorithm using audio-visual HMMs , 1998, ICSLP.

[25]  N. Michael Brooke,et al.  Two- and Three-Dimensional Audio-Visual Speech Synthesis , 1998, AVSP.

[26]  Satoshi Nakamura,et al.  Subjective Evaluation for HMM-Based Speech-To-Lip Movement Synthesis , 1998, AVSP.

[27]  Keiichi Tokuda,et al.  Visual Speech Synthesis Based on Parameter Generation From HMM: Speech-Driven and Text-And-Speech-Driven Approaches , 1998, AVSP.

[28]  Tsuhan Chen,et al.  Audio-to-visual conversion for multimedia communication , 1998, IEEE Trans. Ind. Electron..

[29]  Tsuhan Chen,et al.  Audio-visual integration in multimodal communication , 1998, Proc. IEEE.

[30]  K. Grant,et al.  Measures of auditory-visual integration in nonsense syllables and sentences. , 1998, The Journal of the Acoustical Society of America.

[31]  Satoshi Nakamura,et al.  Lip movement synthesis from speech based on hidden Markov models , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[32]  Aggelos K. Katsaggelos,et al.  Frame Rate and Viseme Analysis for Multimedia Applications to Assist Speechreading , 1998, J. VLSI Signal Process..

[33]  Jenq-Neng Hwang,et al.  Baum-Welch hidden Markov model inversion for reliable audio-to-visual conversion , 1999, 1999 IEEE Third Workshop on Multimedia Signal Processing (Cat. No.99TH8451).

[34]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[35]  Aggelos K. Katsaggelos,et al.  A hidden Markov model based visual speech synthesizer , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[36]  Aggelos K. Katsaggelos,et al.  Speech-to-video conversion for individuals with impaired hearing , 2000 .

[37]  Jay J. Williams,et al.  Subjective analysis of a HMM-based visual speech synthesizer , 2001, IS&T/SPIE Electronic Imaging.