Models for audiovisual fusion in a noisy-vowel recognition task

This paper presents a comparison of four basic architectures dealing with audiovisual speech in a noisy-vowel recognition task. Provided contextual input (signal-to-noise ratio), three of the four architectures respect the "synergy" criterion which means that audiovisual (AV) recognition is better than audio-alone (A) or visual-alone (V) recognition, both in global terms and for each individual phonetic feature. Without contextual input, the performances collapse, but we propose for one model an original approach using an efficient non-linear data processing which leads to more simple algorithms and increases performances of the audiovisual fusion operator.