论文信息 - Feature based representation for audio-visual speech recognition

Feature based representation for audio-visual speech recognition

In this paper, we consider the interaction of acoustic and visual stimuli at the subphonemic level of the distinctive feature. We argue that this provides a natural intermediate level for audio-visual integration and discuss the visual and acoustic feature detection problems that are associated with this task. 1. RECOGNITION USING AUDIO-VISUAL CUES While compelling psychophysical evidence exists [9] suggesting that the auditory and visual modalities are strongly linked in word and syllable recognition, it is generally unclear how to embody such audio-visual links in a coherent and insightful computational framework. In speech recognition systems that are based on HMMs, the core recognition engine (an HMM) is constrained to receive inputs that are a sequence of vectors in a finitedimensional space. As a result, most approaches using such systems end up simply concatenating (at each point in time) the video frame and audio frame into one audio-visual frame and performing recognition with such combined inputs. Alternatively, one uses essentially two HMM systems — one based entirely on video inputs and another entirely on audio inputs and combines likelihood scores from the two in a late integration strategy. In this paper, we consider the possibility of an intermediate level at which the interaction of visual and acoustic cues might occur. We argue that the distinctive feature provides a reasonable intermediate level at which the cues can be integrated prior to lexical access and describe the visual and auditory components of such a feature based strategy for audio-visual recognition. 1.1. Distinctive Features An alternative framework for speech recognition has been pursued in Niyogi et al [6, 5] that utilize the notion of distinctive features [3]. Such a perspective has its roots in phonological theory that suggests that phonemes are not the atomic units of which syllables, words and other linguistic objects are composed but in fact are themselves decomposable into primitives called distinctive features. Each distinctive feature is conceptually a binary valued variable and shown in table 1 are some phonemes and their distinctive features. The distinctive features have typically been viewed as phonological oppositions that separate minimal pairs of confusible phonemes in a language. Thus the minimal pairs p,b, t,d etc. are separated by the feature voice with /p/ and /t/ being unvoiced /voice/ and /b/ and /d/ being voiced /+voice/. The distinctive features may also be viewed as defining a natural phonological class, e.g., labial sounds p,b,m form a class with feature value /+labial/ Feature p b t s m n u Consonantal + + + + + + Labial + + + Alveolar + + + Nasal + + Voicing + + + + Continuant + + Table 1: Phonemes and their associated distinctive feature values. Auditory Cues Visual Cues DISTINCTIVE FEATURE

[1] Keith Waters,et al. Computer facial animation , 1996 .

[2] Partha Niyogi,et al. Distinctive feature detection using support vector machines , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[3] Vladimir Vapnik,et al. Statistical learning theory , 1998 .

[4] Partha Niyogi,et al. A detection framework for locating phonetic events , 1998, ICSLP.

[5] Jialin Zhong,et al. Flexible face animation using MPEG-4/SNHC parameter streams , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[6] D. Thomson,et al. Spectrum estimation and harmonic analysis , 1982, Proceedings of the IEEE.

[7] David G. Stork,et al. Speechreading by Humans and Machines , 1996 .

[8] Michael M. Cohen,et al. Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[9] Michael Kenstowicz,et al. Phonology In Generative Grammar , 1994 .

[10] Partha Niyogi,et al. Incorporating voice onset time to improve letter recognition accuracies , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).