Rationale for Phoneme-Viseme Mapping and Feature Selection in Visual Speech Recognition

We describe a methodology to automatically identify visemes and to determine important oral-cavity features for a speaker dependent, optical continuous speech recognizer. A viseme, as defined by Fisher (1968), represents phones that contain optically similar sequences of oral-cavity movements. Large vocabulary, continuous acoustic speech recognizers that use Hidden Markov Models (HMMs) require accurate phones models (Lee 1989). Similarly, an optical recognizer requires accurate viseme models (Goldschen 1993). Since no universal agreement exists on a subjective viseme definition, we provide an empirical viseme definition using HMMs. We train a set of phone HMMs using optical information, and then cluster similar phone HMMs to form viseme HMMs. We compare our algorithmic phone-to-viseme mapping with the mappings from human speechreading experts. We start, however, by describing the oral-cavity feature selection process to determine features that characterize the movements of the oral-cavity during speech. The feature selection process uses a correlation matrix, principal component analysis, and speechreading heuristics to reduce the number of oral-cavity features from 35 to 13. Our analysis concludes that the dynamic oral-cavity features offer great potential for machine speechreading and for the teaching of human speechreading.