Which Phoneme-to-Viseme Maps Best Improve Visual-Only Computer Lip-Reading?

A critical assumption of all current visual speech recognition systems is that there are visual speech units called visemes which can be mapped to units of acoustic speech, the phonemes. Despite there being a number of published maps it is infrequent to see the effectiveness of these tested, particularly on visual-only lip-reading (many works use audio-visual speech). Here we examine 120 mappings and consider if any are stable across talkers. We show a method for devising maps based on phoneme confusions from an automated lip-reading system, and we present new mappings that show improvements for individual talkers.

[1]  Johan A. du Preez,et al.  Audio-Visual Speech Recognition using SciPy , 2010 .

[2]  A. Montgomery,et al.  Physical characteristics of the lips underlying vowel lipreading performance. , 1983, The Journal of the Acoustical Society of America.

[3]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[4]  S. Lesner Differences in visual intelligibility across talkers , 1982 .

[5]  E. Vajda Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet , 2000 .

[6]  Engin Erzin,et al.  Comparison of Phoneme and Viseme Based Acoustic Units for Speech Driven Realistic Lip Animation , 2007 .

[7]  Tsuhan Chen,et al.  Audio-visual integration in multimodal communication , 1998, Proc. IEEE.

[8]  M. Woodward,et al.  Phoneme perception in lipreading. , 1960, Journal of speech and hearing research.

[9]  James R. Glass,et al.  A segment-based audio-visual speech recognizer: data collection, development, and initial experiments , 2004, ICMI '04.

[10]  J. Franks,et al.  The confusion of English consonant clusters in lipreading. , 1972, Journal of speech and hearing research.

[11]  Barry-John Theobald,et al.  Visual speech synthesis using shape and appearance models , 2003 .

[12]  Stephen J. Cox,et al.  The challenge of multispeaker lip-reading , 2008, AVSP.

[13]  Allen A. Montgomery,et al.  Automatic optically-based recognition of speech , 1988, Pattern Recognit. Lett..

[14]  A. Montgomery,et al.  Visual intelligibility of consonants: a lipreading screening test with implications for aural rehabilitation. , 1976, The Journal of speech and hearing disorders.

[15]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[16]  E. Owens,et al.  Visemes observed by hearing-impaired and normal-hearing adult viewers. , 1985, Journal of speech and hearing research.

[17]  Naomi Harte,et al.  Phoneme-to-viseme Mapping for Visual Speech Recognition , 2012, ICPRAM.

[18]  Dongsuk Yook,et al.  Audio-to-Visual Conversion Using Hidden Markov Models , 2002, PRICAI.

[19]  B. Walden,et al.  Effects of training on the visual recognition of consonants. , 1977, Journal of speech and hearing research.

[20]  A Markides,et al.  Speechreading (lipreading). , 1979, Child: care, health and development.

[21]  Simon Baker,et al.  Active Appearance Models Revisited , 2004, International Journal of Computer Vision.