Lip-Reading Aids Word Recognition Most in Moderate Noise: A Bayesian Explanation Using High-Dimensional Feature Space

Watching a speaker's facial movements can dramatically enhance our ability to comprehend words, especially in noisy environments. From a general doctrine of combining information from different sensory modalities (the principle of inverse effectiveness), one would expect that the visual signals would be most effective at the highest levels of auditory noise. In contrast, we find, in accord with a recent paper, that visual information improves performance more at intermediate levels of auditory noise than at the highest levels, and we show that a novel visual stimulus containing only temporal information does the same. We present a Bayesian model of optimal cue integration that can explain these conflicts. In this model, words are regarded as points in a multidimensional space and word recognition is a probabilistic inference process. When the dimensionality of the feature space is low, the Bayesian model predicts inverse effectiveness; when the dimensionality is high, the enhancement is maximal at intermediate auditory noise levels. When the auditory and visual stimuli differ slightly in high noise, the model makes a counterintuitive prediction: as sound quality increases, the proportion of reported words corresponding to the visual stimulus should first increase and then decrease. We confirm this prediction in a behavioral experiment. We conclude that auditory-visual speech perception obeys the same notion of optimality previously observed only for simple multisensory stimuli.

[1]  N. Holmes The law of inverse effectiveness in neurons and behaviour: Multisensory integration versus normal variability , 2007, Neuropsychologia.

[2]  M. Landy,et al.  Measurement and modeling of depth cue combination: in defense of weak fusion , 1995, Vision Research.

[3]  A A Montgomery,et al.  Auditory and visual contributions to the perception of consonants. , 1974, Journal of speech and hearing research.

[4]  Robert A Jacobs,et al.  Bayesian integration of visual and auditory signals for spatial localization. , 2003, Journal of the Optical Society of America. A, Optics, image science, and vision.

[5]  G. A. Miller,et al.  The intelligibility of speech as a function of the context of the test materials. , 1951, Journal of experimental psychology.

[6]  Alexandre Pouget,et al.  Bayesian multisensory integration and cross-modal spatial links , 2004, Journal of Physiology-Paris.

[7]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[8]  K. Sekiyama,et al.  Cultural and linguistic factors in audiovisual speech processing: The McGurk effect in Chinese subjects , 1997, Perception & psychophysics.

[9]  Lynne E. Bernstein,et al.  Modeling the interaction of phonemic intelligibility and lexical structure in audiovisual word recognition , 1998, Speech Commun..

[10]  B Lyxell,et al.  Speech-reading of synthetic and natural faces: Effects of contextual cueing and mode of presentation , 2001, Scandinavian audiology.

[11]  James M. Hillis,et al.  Combining Sensory Information: Mandatory Fusion Within, but Not Between, Senses , 2002, Science.

[12]  N. Logothetis,et al.  Integration of Touch and Sound in Auditory Cortex , 2005, Neuron.

[13]  B McCormick Audio-visual discrimination of speech. , 1979, Clinical otolaryngology and allied sciences.

[14]  L. Bernstein,et al.  Speechreading and the structure of the lexicon: computationally modeling the effects of reduced phonetic distinctiveness on lexical uniqueness. , 1997, The Journal of the Acoustical Society of America.

[15]  A. Pouget,et al.  Efficient computation and cue integration with noisy population codes , 2001, Nature Neuroscience.

[16]  L. Braida Crossmodal Integration in the Identification of Consonant Segments , 1991, The Quarterly journal of experimental psychology. A, Human experimental psychology.

[17]  T. Stanford,et al.  Evaluating the Operations Underlying Multisensory Integration in the Cat Superior Colliculus , 2005, The Journal of Neuroscience.

[18]  S Buus,et al.  Release from masking caused by envelope fluctuations. , 1985, The Journal of the Acoustical Society of America.

[19]  A. Macleod,et al.  Quantifying the contribution of vision to speech perception in noise. , 1987, British journal of audiology.

[20]  M. Eimer,et al.  Tactile enhancement of auditory detection and perceived loudness , 2007, Brain Research.

[21]  John J. Foxe,et al.  Do you see what I am saying? Exploring visual enhancement of speech comprehension in noisy environments. , 2006, Cerebral cortex.

[22]  Ruth Campbell,et al.  The processing of audio-visual speech: empirical and neural bases , 2008, Philosophical Transactions of the Royal Society B: Biological Sciences.

[23]  Sidney S. Simon,et al.  Merging of the Senses , 2008, Front. Neurosci..

[24]  Matrhew J Pastizzo,et al.  Spoken word frequency counts based on 1.6 million words in American English , 2007, Behavior research methods.

[25]  Jeffery A. Jones,et al.  Neural processes underlying perceptual enhancement by visual speech gestures , 2003, Neuroreport.

[26]  L. Bernstein,et al.  Audiovisual Speech Binding: Convergence or Association? , 2004 .

[27]  Konrad Paul Kording,et al.  Bayesian integration in sensorimotor learning , 2004, Nature.

[28]  Antoinette T. Gesi,et al.  Bimodal speech perception: an examination across languages , 1993 .

[29]  N. Bolognini,et al.  Visual localization of sounds , 2005, Neuropsychologia.

[30]  P C Gordon,et al.  Coherence masking protection in brief noise complexes: effects of temporal patterns. , 1997, The Journal of the Acoustical Society of America.

[31]  Wei Ji Ma,et al.  Bayesian inference with probabilistic population codes , 2006, Nature Neuroscience.

[32]  R. Harald Baayen,et al.  Word Frequency Distributions , 2001 .

[33]  Jia Pei Active Appearance Model , 2010 .

[34]  L. Bernstein,et al.  Similarity structure in visual speech perception and optical phonetic signals , 2007, Perception & psychophysics.

[35]  David C Knill,et al.  Sensorimotor Processing and Goal-Directed Movement. , 2007, Journal of vision.

[36]  B E Walden,et al.  Evaluating the articulation index for auditory-visual consonant recognition. , 1996, The Journal of the Acoustical Society of America.

[37]  Zoubin Ghahramani,et al.  Computation and psychophysics of sensorimotor integration , 1996 .

[38]  L. Bernstein,et al.  Stimulus-based lexical distinctiveness as a general word-recognition mechanism , 2002, Perception & psychophysics.

[39]  D. Knill,et al.  The role of memory in visually guided reaching. , 2007, Journal of vision.

[40]  M. Ernst,et al.  Humans integrate visual and haptic information in a statistically optimal fashion , 2002, Nature.

[41]  J. Schwartz,et al.  Seeing to hear better: evidence for early audio-visual interactions in speech identification , 2004, Cognition.

[42]  P. Baudonniere,et al.  Evidence of a visual-to-auditory cross-modal sensory gating phenomenon as reflected by the human P50 event-related brain potential modulation , 2003, Neuroscience Letters.

[43]  Werner Lutzenberger,et al.  Hearing lips: gamma-band activity during audiovisual speech perception. , 2005, Cerebral cortex.

[44]  N. P. Erber Auditory and audiovisual reception of words in low-frequency noise by children with normal hearing and by children with impaired hearing. , 1971, Journal of speech and hearing research.

[45]  R. Jacobs,et al.  Optimal integration of texture and motion cues to depth , 1999, Vision Research.

[46]  Ken W. Grant,et al.  Evaluating the articulation index for auditory–visual consonant recognition , 1993 .

[47]  K. Grant,et al.  The effect of speechreading on masked detection thresholds for filtered speech. , 2001, The Journal of the Acoustical Society of America.

[48]  P. Haggard,et al.  Can vision of the body ameliorate impaired somatosensory function? , 2007, Neuropsychologia.

[49]  Deborah A. Hall,et al.  Reading Fluent Speech from Talking Faces: Typical Brain Networks and Individual Differences , 2005, Journal of Cognitive Neuroscience.

[50]  L. Bernstein,et al.  Quantified acoustic–optical speech signal incongruity identifies cortical sites of audiovisual speech processing , 2008, Brain Research.

[51]  Brian C. J. Moore,et al.  Voice pitch as an aid to lipreading , 1981, Nature.

[52]  J. Saunders,et al.  Do humans optimally integrate stereo and texture information for judgments of surface slant? , 2003, Vision Research.

[53]  A. Yuille,et al.  Bayesian decision theory and psychophysics , 1996 .

[54]  C. Spence,et al.  The Handbook of Multisensory Processing , 2004 .

[55]  A. Diederich,et al.  Bimodal and trimodal multisensory enhancement: Effects of stimulus onset and intensity on reaction time , 2004, Perception & psychophysics.

[56]  Q. Summerfield,et al.  Intermodal timing relations and audio-visual speech recognition by normal-hearing adults. , 1985, The Journal of the Acoustical Society of America.

[57]  D. Burr,et al.  The Ventriloquist Effect Results from Near-Optimal Bimodal Integration , 2004, Current Biology.

[58]  G. A. Calvert,et al.  Auditory-visual processing represented in the human superior temporal gyrus , 2007, Neuroscience.

[59]  M. Alex Meredith,et al.  Neurons and behavior: the same rules of multisensory integration apply , 1988, Brain Research.

[60]  Adele Diederich,et al.  Visual–tactile integration: does stimulus duration influence the relative amount of response enhancement? , 2006, Experimental Brain Research.

[61]  D. Pisoni,et al.  Recognizing Spoken Words: The Neighborhood Activation Model , 1998, Ear and hearing.

[62]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[63]  D. Massaro Speech Perception By Ear and Eye: A Paradigm for Psychological Inquiry , 1989 .

[64]  P F Seitz,et al.  The use of visible speech cues for improving auditory detection of spoken sentences. , 2000, The Journal of the Acoustical Society of America.

[65]  Paul J. Laurienti,et al.  On the use of superadditivity as a metric for characterizing multisensory integration in functional neuroimaging studies , 2005, Experimental Brain Research.

[66]  Konrad Paul Kording,et al.  Causal Inference in Multisensory Perception , 2007, PloS one.

[67]  Michael S. Beauchamp,et al.  Statistical criteria in fMRI studies of multisensory integration , 2005, Neuroinformatics.

[68]  C M Reed,et al.  A comparison of the effects of filtering and sensorineural hearing loss on patients of consonant confusions. , 1978, Journal of speech and hearing research.

[69]  David C Knill,et al.  Mixture models and the probabilistic structure of depth cues , 2003, Vision Research.

[70]  R. J. van Beers,et al.  Integration of proprioceptive and visual position-information: An experimentally supported model. , 1999, Journal of neurophysiology.

[71]  N. P. Erber Auditory-visual perception of speech. , 1975, The Journal of speech and hearing disorders.

[72]  E. T. Auer The influence of the lexicon on speech read word recognition: Contrasting segmental and lexical distinctiveness , 2002, Psychonomic bulletin & review.

[73]  A J Van Opstal,et al.  Auditory-visual interactions subserving goal-directed saccades in a complex scene. , 2002, Journal of neurophysiology.

[74]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[75]  D. Knill Robust cue integration: a Bayesian model and evidence from cue-conflict studies with stereoscopic and figure cues to slant. , 2007, Journal of vision.

[76]  B. Stein,et al.  Spatial factors determine the activity of multisensory neurons in cat superior colliculus , 1986, Brain Research.

[77]  M. Wallace,et al.  Representation and integration of multiple sensory inputs in primate superior colliculus. , 1996, Journal of neurophysiology.

[78]  D. Massaro Perceiving talking faces: from speech perception to a behavioral principle , 1999 .

[79]  N. P. Erber Interaction of audition and vision in the recognition of oral speech stimuli. , 1969, Journal of speech and hearing research.

[80]  David Poeppel,et al.  Visual speech speeds up the neural processing of auditory speech. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[81]  Neil W. Roach,et al.  Resolving multisensory conflict: a strategy for balancing the costs and benefits of audio-visual integration , 2006, Proceedings of the Royal Society B: Biological Sciences.

[82]  R. Campbell Speechreading: advances in understanding its cortical bases and implications for deafness and speech rehabilitation. , 1998, Scandinavian audiology. Supplementum.

[83]  D. Knill,et al.  The Bayesian brain: the role of uncertainty in neural coding and computation , 2004, Trends in Neurosciences.

[84]  M. Wallace,et al.  Unifying multisensory signals across time and space , 2004, Experimental Brain Research.

[85]  Sharon M. Thomas,et al.  Contributions of oral and extraoral facial movement to visual and audiovisual speech perception. , 2004, Journal of experimental psychology. Human perception and performance.

[86]  C. Schroeder,et al.  Neuronal Oscillations and Multisensory Interaction in Primary Auditory Cortex , 2007, Neuron.

[87]  Ulrik R Beierholm,et al.  Sound-induced flash illusion as an optimal percept , 2005, Neuroreport.

[88]  M. D. Wang,et al.  Consonant confusions in noise: a study of perceptual features. , 1973, The Journal of the Acoustical Society of America.

[89]  K. Grant,et al.  Auditory-visual speech recognition by hearing-impaired subjects: consonant recognition, sentence recognition, and auditory-visual integration. , 1998, The Journal of the Acoustical Society of America.

[90]  Lynne E. Bernstein,et al.  Spatiotemporal dynamics of audiovisual speech processing , 2008, NeuroImage.

[91]  Margaret MacEachern,et al.  On the visual distinctiveness of words in the English lexicon , 2000, J. Phonetics.

[92]  Kazuyuki Aihara,et al.  Bayesian Inference Explains Perception of Unity and Ventriloquism Aftereffect: Identification of Common Sources of Audiovisual Stimuli , 2007, Neural Computation.

[93]  J. Rieger,et al.  Audiovisual Temporal Correspondence Modulates Human Multisensory Superior Temporal Sulcus Plus Primary Sensory Cortices , 2007, The Journal of Neuroscience.

[94]  Barry Stein,et al.  A Bayesian model unifies multisensory spatial localization with the physiological properties of the superior colliculus , 2007, Experimental Brain Research.

[95]  L. Bernstein,et al.  Speech perception without hearing , 2000, Perception & psychophysics.

[96]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.