Audio-visual scene analysis: evidence for a "very-early" integration process in audio-visual speech perception

Recent experiments suggest that audio-visual interaction in speech perception could begin at a very early level, in which the visual input could improve the detection of speech sounds embedded in noise [1]. We show here that the “speech detection” benefit may result in a “speech identification” benefit different from lipreading per se. The experimental trick consists in using a series of lip gestures compatible with a number of different audio configurations, e.g. [y u ty tu ky ku dy du gy gu] in French. We show that the visual identification of this corpus is random, but, when added to the sound merged in a large amount of cocktail-party noise, vision happens to improve the identification of one phonetic feature, i.e. plosive voicing. We discuss this result in terms of audio-visual scene analysis. 1. AUDIO-VISUAL SPEECH FUSION: LATE, EARLY OR ... VERY EARLY? The literature on audiovisual (AV) fusion in speech perception is largely organized around the question of the fusion level: late or early whether it follows or precedes phonetic identification. While late-integration models account for a large body of experimental evidence and provide the basis to most developments in the framework of AV speech recognition, a number of experimental data appear incompatible with late integration. Let us mention the AV-VOT problem [2], AV interaction in the processing of rate and voicing [3] and even difficulties with the McGurk effect [4,5]. Lateand early-integration models share a common assumption of independence of the primitive monosensorial processing. That is, information would be first extracted separately in each sensorial channel before fusion. However, a number of recent studies have raised serious doubts about this assumption. The first study by Grant and Seitz [1] showed that visible movements of the speech articulators allowed to improve the detection of speech embedded in acoustical white noise, with a gain of about 2 dBs. Further experiments [6,7] confirmed this result, and showed that the correlation between energy in the F2-F3 region and the variation of inter-lip separation was the main determinant of the detection improvement (see also [8]). Extraction of auditory cues thanks to visual movements can be understood as some kind of "very early" fusion process, which should occur prior to the fusion/identification stages considered by earlyor lateintegration models; we shall come back to this question later. Does this process add anything to the intelligibility of speech in noise? The role of lipreading in understanding noisy speech is quite well-known (since [9]) but the question here is differ the s impro addit is fou speak could case spea and mech thank interp provi contr audib