Cue Integration in Categorical Tasks: Insights from Audio-Visual Speech Perception

Previous cue integration studies have examined continuous perceptual dimensions (e.g., size) and have shown that human cue integration is well described by a normative model in which cues are weighted in proportion to their sensory reliability, as estimated from single-cue performance. However, this normative model may not be applicable to categorical perceptual dimensions (e.g., phonemes). In tasks defined over categorical perceptual dimensions, optimal cue weights should depend not only on the sensory variance affecting the perception of each cue but also on the environmental variance inherent in each task-relevant category. Here, we present a computational and experimental investigation of cue integration in a categorical audio-visual (articulatory) speech perception task. Our results show that human performance during audio-visual phonemic labeling is qualitatively consistent with the behavior of a Bayes-optimal observer. Specifically, we show that the participants in our task are sensitive, on a trial-by-trial basis, to the sensory uncertainty associated with the auditory and visual cues, during phonemic categorization. In addition, we show that while sensory uncertainty is a significant factor in determining cue weights, it is not the only one and participants' performance is consistent with an optimal model in which environmental, within category variability also plays a role in determining cue weights. Furthermore, we show that in our task, the sensory variability affecting the visual modality during cue-combination is not well estimated from single-cue performance, but can be estimated from multi-cue performance. The findings and computational principles described here represent a principled first step towards characterizing the mechanisms underlying human cue integration in categorical tasks.

[1]  Dennis H. Klatt,et al.  Software for a cascade/parallel formant synthesizer , 1980 .

[2]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[3]  J. Werker,et al.  An exploration of why preschoolers perform differently than do adults in audiovisual speech perception tasks. , 1997, Journal of experimental child psychology.

[4]  N. P. Erber Auditory and audiovisual reception of words in low-frequency noise by children with normal hearing and by children with impaired hearing. , 1971, Journal of speech and hearing research.

[5]  Jean-Luc Schwartz,et al.  The 0/0 problem in the fuzzy-logical model of perception. , 2006, The Journal of the Acoustical Society of America.

[6]  R. Jacobs,et al.  Optimal integration of texture and motion cues to depth , 1999, Vision Research.

[7]  Jennifer S. Pardo,et al.  On the Bistability of Sine Wave Analogues of Speech , 2001, Psychological science.

[8]  Dominic W. Massaro,et al.  Synthesis of visible speech , 1990 .

[9]  M. Ernst,et al.  Humans integrate visual and haptic information in a statistically optimal fashion , 2002, Nature.

[10]  Naomi H. Feldman,et al.  The influence of categories on perception: explaining the perceptual magnet effect as optimal statistical inference. , 2009, Psychological review.

[11]  Ulrik R Beierholm,et al.  Sound-induced flash illusion as an optimal percept , 2005, Neuroreport.

[12]  D. Burnham,et al.  Impact of language on development of auditory-visual speech perception. , 2008, Developmental science.

[13]  D. Knill,et al.  The Bayesian brain: the role of uncertainty in neural coding and computation , 2004, Trends in Neurosciences.

[14]  J. Schwartz A reanalysis of McGurk data suggests that audiovisual fusion in speech perception is subject-dependent. , 2010, The Journal of the Acoustical Society of America.

[15]  Wei Ji Ma,et al.  Lip-Reading Aids Word Recognition Most in Moderate Noise: A Bayesian Explanation Using High-Dimensional Feature Space , 2009, PloS one.

[16]  M. Wallace,et al.  Unifying multisensory signals across time and space , 2004, Experimental Brain Research.

[17]  J. Werker,et al.  Is the integration of heard and seen speech mandatory for infants? , 2004, Developmental psychobiology.

[18]  K. Sekiyama,et al.  Cultural and linguistic factors in audiovisual speech processing: The McGurk effect in Chinese subjects , 1997, Perception & psychophysics.

[19]  James M. Hillis,et al.  Combining Sensory Information: Mandatory Fusion Within, but Not Between, Senses , 2002, Science.

[20]  D. Massaro Perceiving talking faces: from speech perception to a behavioral principle , 1999 .

[21]  D W Massaro,et al.  Children's perception of visual and auditory speech. , 1984, Child development.

[22]  Athanasios Papoulis,et al.  Probability, Random Variables and Stochastic Processes , 1965 .

[23]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[24]  John J. Foxe,et al.  Do you see what I am saying? Exploring visual enhancement of speech comprehension in noisy environments. , 2006, Cerebral cortex.

[25]  Ruth Campbell,et al.  The processing of audio-visual speech: empirical and neural bases , 2008, Philosophical Transactions of the Royal Society B: Biological Sciences.

[26]  Michael I. Jordan,et al.  An internal model for sensorimotor integration. , 1995, Science.

[27]  D W Massaro,et al.  American Psychological Association, Inc. Evaluation and Integration of Visual and Auditory Information in Speech Perception , 2022 .

[28]  D. Knill Robust cue integration: a Bayesian model and evidence from cue-conflict studies with stereoscopic and figure cues to slant. , 2007, Journal of vision.

[29]  A. Lotto,et al.  Cue weighting in auditory categorization: implications for first and second language acquisition. , 2006, The Journal of the Acoustical Society of America.

[30]  W. Swanson,et al.  Extracting thresholds from noisy psychophysical data , 1992, Perception & psychophysics.

[31]  R. Jacobs,et al.  Perception of speech reflects optimal use of probabilistic speech cues , 2008, Cognition.

[32]  W. Richards,et al.  Perception as Bayesian Inference , 2008 .

[33]  Q. Summerfield,et al.  Lipreading and audio-visual speech perception. , 1992, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[34]  G. E. Peterson,et al.  Control Methods Used in a Study of the Vowels , 1951 .

[35]  D. Burr,et al.  The Ventriloquist Effect Results from Near-Optimal Bimodal Integration , 2004, Current Biology.

[36]  R. J. van Beers,et al.  Integration of proprioceptive and visual position-information: An experimentally supported model. , 1999, Journal of neurophysiology.

[37]  J. Saunders,et al.  Do humans optimally integrate stereo and texture information for judgments of surface slant? , 2003, Vision Research.

[38]  A. Yuille,et al.  Bayesian decision theory and psychophysics , 1996 .

[39]  M. Landy,et al.  Measurement and modeling of depth cue combination: in defense of weak fusion , 1995, Vision Research.

[40]  Robert A Jacobs,et al.  Bayesian integration of visual and auditory signals for spatial localization. , 2003, Journal of the Optical Society of America. A, Optics, image science, and vision.

[41]  M A Schmuckler,et al.  The McGurk effect in infants. , 1995, Perception & psychophysics.

[42]  H. Bülthoff,et al.  Merging the senses into a robust percept , 2004, Trends in Cognitive Sciences.

[43]  Bruno O. Shubert,et al.  Random variables and stochastic processes , 1979 .

[44]  R. Aslin,et al.  Visual speech contributes to phonetic learning in 6-month-old infants , 2008, Cognition.

[45]  Konrad Paul Kording,et al.  Bayesian integration in sensorimotor learning , 2004, Nature.

[46]  Antoinette T. Gesi,et al.  Bimodal speech perception: an examination across languages , 1993 .

[47]  Konrad Paul Kording,et al.  Causal Inference in Multisensory Perception , 2007, PloS one.

[48]  Martin S. Banks,et al.  Extra-retinal and perspective cues cause the small range of the induced effect , 1998, Vision Research.

[49]  D. Massaro Speech Perception By Ear and Eye: A Paradigm for Psychological Inquiry , 1989 .

[50]  David C Knill,et al.  Sensorimotor Processing and Goal-Directed Movement. , 2007, Journal of vision.

[51]  R. Jacobs What determines visual cue reliability? , 2002, Trends in Cognitive Sciences.