Psychophysics of the McGurk and other audiovisual speech integration effects.

When the auditory and visual components of spoken audiovisual nonsense syllables are mismatched, perceivers produce four different types of perceptual responses, auditory correct, visual correct, fusion (the so-called McGurk effect), and combination (i.e., two consonants are reported). Here, quantitative measures were developed to account for the distribution of the four types of perceptual responses to 384 different stimuli from four talkers. The measures included mutual information, correlations, and acoustic measures, all representing audiovisual stimulus relationships. In Experiment 1, open-set perceptual responses were obtained for acoustic /bɑ/ or /lɑ/ dubbed to video /bɑ, dɑ, gɑ, vɑ, zɑ, lɑ, wɑ, ðɑ/. The talker, the video syllable, and the acoustic syllable significantly influenced the type of response. In Experiment 2, the best predictors of response category proportions were a subset of the physical stimulus measures, with the variance accounted for in the perceptual response category proportions between 17% and 52%. That audiovisual stimulus relationships can account for perceptual response distributions supports the possibility that internal representations are based on modality-specific stimulus relationships.

[1]  L. Braida Crossmodal Integration in the Identification of Consonant Segments , 1991, The Quarterly journal of experimental psychology. A, Human experimental psychology.

[2]  L. Bernstein,et al.  Audiovisual Speech Binding: Convergence or Association? , 2004 .

[3]  Coarticulation • Suprasegmentals,et al.  Acoustic Phonetics , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[4]  Eric Vatikiotis-Bateson,et al.  Audiovisual Speech Processing: Contributors , 2012 .

[5]  Willy Wong,et al.  A linear model of acoustic-to-facial mapping: model parameters, data set size, and generalization across speakers. , 2008, The Journal of the Acoustical Society of America.

[6]  Dominic W. Massaro,et al.  Speechreading: illusion or window into pattern recognition , 1999, Trends in Cognitive Sciences.

[7]  L. Bernstein,et al.  Development of a facility for simultaneous recordings of acoustic, optical (3‐D motion and video), and physiological speech data , 2000 .

[8]  A. Meltzoff,et al.  Integrating speech information across talkers, gender, and sensory modality: Female faces and male voices in the McGurk effect , 1991, Perception & psychophysics.

[9]  R. Campbell,et al.  Hearing by eye 2 : advances in the psychology of speechreading and auditory-visual speech , 1997 .

[10]  Emily B. Myers,et al.  The Perception of Voice Onset Time: An fMRI Investigation of Phonetic Category Structure , 2005, Journal of Cognitive Neuroscience.

[11]  D. Poeppel,et al.  Temporal window of integration in auditory-visual speech perception , 2007, Neuropsychologia.

[12]  Lawrence D Rosenblum,et al.  Speech Perception as a Multimodal Phenomenon , 2008, Current directions in psychological science.

[13]  Ruth Campbell,et al.  The processing of audio-visual speech: empirical and neural bases , 2008, Philosophical Transactions of the Royal Society B: Biological Sciences.

[14]  Steven L. Small,et al.  Abstract Coding of Audiovisual Speech: Beyond Sensory Representation , 2007, Neuron.

[15]  D. Massaro,et al.  Perception of Visible Speech: Influence of Spatial Quantization , 1997, Perception.

[16]  R. Hari,et al.  Seeing speech: visual information from lip movements modifies activity in the human auditory cortex , 1991, Neuroscience Letters.

[17]  R. Campbell,et al.  Hearing by eye : the psychology of lip-reading , 1988 .

[18]  P. Gribble,et al.  Temporal constraints on the McGurk effect , 1996, Perception & psychophysics.

[19]  A A Montgomery,et al.  Auditory and visual contributions to the perception of consonants. , 1974, Journal of speech and hearing research.

[20]  K. G. Munhall,et al.  Spatial frequency requirements for audiovisual speech perception , 2004, Perception & psychophysics.

[21]  Joshua I. Breier,et al.  Differential brain activation patterns during perception of voice and tone onset time series: a MEG study , 2003, NeuroImage.

[22]  Vicki Bruce,et al.  Facial identity and facial speech processing: Familiar faces and voices in the McGurk effect , 1995, Perception & psychophysics.

[23]  Steven Greenberg,et al.  Effects of Spectro-Temporal Asynchrony in Auditory and Auditory-Visual Speech Processing , 2004 .

[24]  L. Bernstein,et al.  Quantified acoustic–optical speech signal incongruity identifies cortical sites of audiovisual speech processing , 2008, Brain Research.

[25]  G. Plant Perceiving Talking Faces: From Speech Perception to a Behavioral Principle , 1999 .

[26]  Jeremy I. Skipper,et al.  Seeing Voices : How Cortical Areas Supporting Speech Production Mediate Audiovisual Speech Perception , 2007 .

[27]  D W Massaro,et al.  Perception of asynchronous and conflicting visual and auditory speech. , 1996, The Journal of the Acoustical Society of America.

[28]  Catherine Liégeois-Chauvel,et al.  Hemispheric lateralization of voice onset time (VOT) comparison between depth and scalp EEG recordings , 2005, NeuroImage.

[29]  Lynne E. Bernstein,et al.  Mismatch Negativity with Visual-only and Audiovisual Speech , 2009, Brain Topography.

[30]  L. Bernstein,et al.  Similarity structure in visual speech perception and optical phonetic signals , 2007, Perception & psychophysics.

[31]  Q. Summerfield Some preliminaries to a comprehensive account of audio-visual speech perception. , 1987 .

[32]  H. McGurk,et al.  Visual influences on speech perception processes , 1978, Perception & psychophysics.

[33]  Bruno H. Repp,et al.  Exploring the “McGurk effect” , 1983 .

[34]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[35]  Satoshi Nakamura,et al.  Lip movement synthesis from speech based on hidden Markov models , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[36]  Q Summerfield,et al.  Detection and Resolution of Audio-Visual Incompatibility in the Perception of Vowels , 1984, The Quarterly journal of experimental psychology. A, Human experimental psychology.

[37]  P K Kuhl,et al.  The role of visual information in the processing of , 1989, Perception & psychophysics.

[38]  Abeer Alwan,et al.  On the Relationship between Face Movements, Tongue Movements, and Speech Acoustics , 2002, EURASIP J. Adv. Signal Process..

[39]  K. Sekiyama,et al.  Cultural and linguistic factors in audiovisual speech processing: The McGurk effect in Chinese subjects , 1997, Perception & psychophysics.

[40]  Harriet J. Nock,et al.  Assessing face and speech consistency for monologue detection in video , 2002, MULTIMEDIA '02.

[41]  Emily B. Myers,et al.  An event-related fMRI investigation of voice-onset time discrimination , 2008, NeuroImage.

[42]  Q. Summerfield,et al.  Intermodal timing relations and audio-visual speech recognition by normal-hearing adults. , 1985, The Journal of the Acoustical Society of America.

[43]  Y. Tohkura,et al.  McGurk effect in non-English listeners: few visual effects for Japanese subjects hearing Japanese syllables of high auditory intelligibility. , 1991, The Journal of the Acoustical Society of America.

[44]  D. Massaro Speech Perception By Ear and Eye: A Paradigm for Psychological Inquiry , 1989 .

[45]  Fumitada Itakura,et al.  Speech analysis and synthesis methods developed at ECL in NTT - From LPC to LSP - , 1986, Speech Commun..

[46]  Lee M. Miller,et al.  Behavioral/systems/cognitive Perceptual Fusion and Stimulus Coincidence in the Cross- Modal Integration of Speech , 2022 .

[47]  Kenneth R. Boff,et al.  Sensory processes and perception , 1986 .

[48]  Asif A. Ghazanfar,et al.  The Natural Statistics of Audiovisual Speech , 2009, PLoS Comput. Biol..

[49]  Bernard Fraysse,et al.  McGurk effects in cochlear-implanted deaf subjects , 2008, Brain Research.

[50]  K. Green,et al.  Acoustic cues to place of articulation and the McGurk effect: the role of release bursts, aspiration, and formant transitions. , 1997, Journal of speech, language, and hearing research : JSLHR.

[51]  L. Rosenblum,et al.  Lip-Read Me Now, Hear Me Better Later , 2006, Psychological science.

[52]  R. Plomp,et al.  Speechreading supplemented with formant-frequency information from voiced speech. , 1985, The Journal of the Acoustical Society of America.

[53]  Y. Kawakubo,et al.  Topographic change of ERP due to discrimination of CV syllables with various vowel durations , 2002 .

[54]  T. M. Nearey,et al.  Speech perception as pattern recognition. , 1997, The Journal of the Acoustical Society of America.

[55]  Eric Vatikiotis-Bateson,et al.  The moving face during speech communication , 1998 .

[56]  Jae-On Kim,et al.  Factor Analysis: Statistical Methods and Practical Issues , 1978 .

[57]  A M Liberman,et al.  Perception of the speech code. , 1967, Psychological review.

[58]  D E Callan,et al.  Single-sweep EEG analysis of neural processes underlying perception and production of vowels. , 2000, Brain research. Cognitive brain research.

[59]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[60]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[61]  Hani Yehia,et al.  Quantitative association of vocal-tract and facial behavior , 1998, Speech Commun..

[62]  Mikko Sams,et al.  Processing of audiovisual speech in Broca's area , 2005, NeuroImage.

[63]  M. Sams,et al.  Primary auditory cortex activation by visual speech: an fMRI study at 3 T , 2005, Neuroreport.

[64]  Michelle R. Molis,et al.  MEG correlates of categorical perception of a voice onset time continuum in humans. , 1998, Brain research. Cognitive brain research.