Auditory Perceptual Learning for Speech Perception Can be Enhanced by Audiovisual Training

Speech perception under audiovisual (AV) conditions is well known to confer benefits to perception such as increased speed and accuracy. Here, we investigated how AV training might benefit or impede auditory perceptual learning of speech degraded by vocoding. In Experiments 1 and 3, participants learned paired associations between vocoded spoken nonsense words and nonsense pictures. In Experiment 1, paired-associates (PA) AV training of one group of participants was compared with audio-only (AO) training of another group. When tested under AO conditions, the AV-trained group was significantly more accurate than the AO-trained group. In addition, pre- and post-training AO forced-choice consonant identification with untrained nonsense words showed that AV-trained participants had learned significantly more than AO participants. The pattern of results pointed to their having learned at the level of the auditory phonetic features of the vocoded stimuli. Experiment 2, a no-training control with testing and re-testing on the AO consonant identification, showed that the controls were as accurate as the AO-trained participants in Experiment 1 but less accurate than the AV-trained participants. In Experiment 3, PA training alternated AV and AO conditions on a list-by-list basis within participants, and training was to criterion (92% correct). PA training with AO stimuli was reliably more effective than training with AV stimuli. We explain these discrepant results in terms of the so-called “reverse hierarchy theory” of perceptual learning and in terms of the diverse multisensory and unisensory processing resources available to speech perception. We propose that early AV speech integration can potentially impede auditory perceptual learning; but visual top-down access to relevant auditory features can promote auditory perceptual learning.

[1]  Coarticulation • Suprasegmentals,et al.  Acoustic Phonetics , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[2]  D Norris,et al.  Merging information in speech recognition: Feedback is never necessary , 2000, Behavioral and Brain Sciences.

[3]  D. Simons,et al.  Detecting Changes in Novel, Complex Three-dimensional Objects , 2000 .

[4]  G. A. Miller,et al.  An Analysis of Perceptual Confusions Among Some English Consonants , 1955 .

[5]  A. Macleod,et al.  Quantifying the contribution of vision to speech perception in noise. , 1987, British journal of audiology.

[6]  David Poeppel,et al.  Visual speech speeds up the neural processing of auditory speech. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[7]  M. Halle,et al.  Preliminaries to Speech Analysis: The Distinctive Features and Their Correlates , 1961 .

[8]  John J. Foxe,et al.  The timing and laminar profile of converging inputs to multisensory areas of the macaque neocortex. , 2002, Brain research. Cognitive brain research.

[9]  Stuart Rosen,et al.  Neural correlates of intelligibility in speech investigated with noise vocoded speech--a positron emission tomography study. , 2006, The Journal of the Acoustical Society of America.

[10]  Lawrence D Rosenblum,et al.  Speech Perception as a Multimodal Phenomenon , 2008, Current directions in psychological science.

[11]  A. Puce,et al.  Neuronal oscillations and visual amplification of speech , 2008, Trends in Cognitive Sciences.

[12]  C. Schroeder,et al.  Sensory Convergence in Low-Level Cortices , 2012 .

[13]  J. Eggermont,et al.  What's to lose and what's to learn: Development under auditory deprivation, cochlear implants and limits of cortical plasticity , 2007, Brain Research Reviews.

[14]  Steven L. Small,et al.  Abstract Coding of Audiovisual Speech: Beyond Sensory Representation , 2007, Neuron.

[15]  Noam Chomsky,et al.  The Sound Pattern of English , 1968 .

[16]  John J. Foxe,et al.  The case for feedforward multisensory convergence during early cortical processing , 2005, Neuroreport.

[17]  E. Bullmore,et al.  Activation of auditory cortex during silent lipreading. , 1997, Science.

[18]  Joost X. Maier,et al.  Multisensory Integration of Dynamic Faces and Voices in Rhesus Monkey Auditory Cortex , 2005 .

[19]  R. Hari,et al.  Seeing speech: visual information from lip movements modifies activity in the human auditory cortex , 1991, Neuroscience Letters.

[20]  Lynne E. Bernstein,et al.  Modeling the interaction of phonemic intelligibility and lexical structure in audiovisual word recognition , 1998, Speech Commun..

[21]  A Boothroyd,et al.  Speechreading enhancement: a comparison of spatial-tactile display of voice fundamental frequency (F0) with auditory F0. , 1996, The Journal of the Acoustical Society of America.

[22]  Matthew H. Davis,et al.  Lexical information drives perceptual learning of distorted speech: evidence from the comprehension of noise-vocoded sentences. , 2005, Journal of experimental psychology. General.

[23]  L. Bernstein,et al.  Audiovisual Speech Processing: Visual speech perception , 2012 .

[24]  M E Demorest,et al.  Lipreading sentences with vibrotactile vocoders: performance of normal-hearing and hearing-impaired subjects. , 1991, The Journal of the Acoustical Society of America.

[25]  Rutvik H. Desai,et al.  Specialization along the Left Superior Temporal Sulcus for Auditory Categorization , 2010, Cerebral cortex.

[26]  L. Bernstein,et al.  Quantified acoustic–optical speech signal incongruity identifies cortical sites of audiovisual speech processing , 2008, Brain Research.

[27]  Andrew Faulkner,et al.  The use of visual cues in the perception of non-native consonant contrasts. , 2006, The Journal of the Acoustical Society of America.

[28]  Jeremy I. Skipper,et al.  Seeing Voices : How Cortical Areas Supporting Speech Production Mediate Audiovisual Speech Perception , 2007 .

[29]  M. Ahissar,et al.  Low-Level Information and High-Level Perception: The Case of Speech in Noise , 2008, PLoS biology.

[30]  Robert L. Goldstone Influences of categorization on perceptual discrimination. , 1994, Journal of experimental psychology. General.

[31]  Lynne E. Bernstein,et al.  Spatiotemporal dynamics of audiovisual speech processing , 2008, NeuroImage.

[32]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[33]  Lynne E. Bernstein,et al.  Mismatch Negativity with Visual-only and Audiovisual Speech , 2009, Brain Topography.

[34]  Paula C. Stacey,et al.  Effectiveness of computer-based auditory training for adult users of cochlear implants , 2010, International journal of audiology.

[35]  S. Grossberg,et al.  Neural dynamics of variable-rate speech categorization. , 1997, Journal of experimental psychology. Human perception and performance.

[36]  John J. Foxe,et al.  Seeing voices: High-density electrical mapping and source-analysis of the multisensory mismatch negativity evoked during the McGurk illusion , 2007, Neuropsychologia.

[37]  M. Giard,et al.  Auditory-Visual Integration during Multimodal Object Recognition in Humans: A Behavioral and Electrophysiological Study , 1999, Journal of Cognitive Neuroscience.

[38]  N. Logothetis,et al.  Multisensory Influences on Auditory Processing: Perspectives from fMRI and Electrophysiology , 2012 .

[39]  L. Bernstein,et al.  Speech perception without hearing , 2000, Perception & psychophysics.

[40]  Q. Summerfield Some preliminaries to a comprehensive account of audio-visual speech perception. , 1987 .

[41]  E. T. Possing,et al.  Human temporal lobe activation by speech and nonspeech sounds. , 2000, Cerebral cortex.

[42]  M. D. Wang,et al.  Consonant confusions in noise: a study of perceptual features. , 1973, The Journal of the Acoustical Society of America.

[43]  P K Kuhl,et al.  The role of visual information in the processing of , 1989, Perception & psychophysics.

[44]  E. T. Auer The influence of the lexicon on speech read word recognition: Contrasting segmental and lexical distinctiveness , 2002, Psychonomic bulletin & review.

[45]  B. Argall,et al.  Integration of Auditory and Visual Information about Objects in Superior Temporal Sulcus , 2004, Neuron.

[46]  F. Lin,et al.  Onset timing of cross‐sensory activations and multisensory interactions in auditory and visual sensory cortices , 2010, The European journal of neuroscience.

[47]  P. Luce,et al.  Probabilistic Phonotactics and Neighborhood Activation in Spoken Word Recognition , 1999 .

[48]  Abeer Alwan,et al.  On the Relationship between Face Movements, Tongue Movements, and Speech Acoustics , 2002, EURASIP J. Adv. Signal Process..

[49]  B. Fraysse,et al.  Evidence that cochlear-implanted deaf patients are better multisensory integrators , 2007, Proceedings of the National Academy of Sciences.

[50]  D. Reisberg,et al.  Easy to hear but hard to understand: A lip-reading advantage with intact auditory stimuli. , 1987 .

[51]  Y. Tohkura,et al.  McGurk effect in non-English listeners: few visual effects for Japanese subjects hearing Japanese syllables of high auditory intelligibility. , 1991, The Journal of the Acoustical Society of America.

[52]  Lee M. Miller,et al.  Behavioral/systems/cognitive Perceptual Fusion and Stimulus Coincidence in the Cross- Modal Integration of Speech , 2022 .

[53]  S. Hochstein,et al.  Task difficulty and the specificity of perceptual learning , 1997, Nature.

[54]  John J. Foxe,et al.  Do you see what I am saying? Exploring visual enhancement of speech comprehension in noisy environments. , 2006, Cerebral cortex.

[55]  L. Bernstein,et al.  Psychophysics of the McGurk and other audiovisual speech integration effects. , 2011, Journal of experimental psychology. Human perception and performance.

[56]  Matthew H. Davis,et al.  Generalization of perceptual learning of vocoded speech. , 2011, Journal of experimental psychology. Human perception and performance.

[57]  Fan-Gang Zeng,et al.  Cochlear Implants: Auditory Prostheses and Electric Hearing , 2004, Springer Handbook of Auditory Research.

[58]  J. Kaas,et al.  Subdivisions of auditory cortex and processing streams in primates. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[59]  S. Scott,et al.  Identification of a pathway for intelligible speech in the left temporal lobe. , 2000, Brain : a journal of neurology.

[60]  L. Bernstein,et al.  Temporal and spatio-temporal vibrotactile displays for voice fundamental frequency: an initial evaluation of a new vibrotactile speech perception aid with normal-hearing and hearing-impaired individuals. , 1998, The Journal of the Acoustical Society of America.

[61]  Audrey R. Nath,et al.  Dynamic Changes in Superior Temporal Sulcus Connectivity during Perception of Noisy Audiovisual Speech , 2011, The Journal of Neuroscience.

[62]  Mikko Sams,et al.  Processing of changes in visual speech in the human auditory cortex. , 2002, Brain research. Cognitive brain research.

[63]  David Taylor Hearing by Eye: The Psychology of Lip-Reading , 1988 .

[64]  Lynne E. Bernstein,et al.  Auditory speech detection in noise enhanced by lipreading , 2004, Speech Commun..

[65]  A M Liberman,et al.  On pushing the voice-onset-time (vot) boundary about. , 1975, Language and speech.

[66]  Asif A Ghazanfar,et al.  Interactions between the Superior Temporal Sulcus and Auditory Cortex Mediate Dynamic Face/Voice Integration in Rhesus Monkeys , 2008, The Journal of Neuroscience.

[67]  J. Driver,et al.  Multisensory Interplay Reveals Crossmodal Influences on ‘Sensory-Specific’ Brain Regions, Neural Responses, and Judgments , 2008, Neuron.

[68]  D. J. Felleman,et al.  Distributed hierarchical processing in the primate cerebral cortex. , 1991, Cerebral cortex.

[69]  Dimitrios Pantazis,et al.  Visual phonetic processing localized using speech and nonspeech face gestures in video and point‐light displays , 2011, Human brain mapping.

[70]  R. Campbell,et al.  Evidence from functional magnetic resonance imaging of crossmodal binding in the human heteromodal cortex , 2000, Current Biology.

[71]  Qian-Jie Fu,et al.  Moderate auditory training can improve speech performance of adult cochlear implant patients , 2005 .

[72]  S. Zeki The Ferrier Lecture 1995 Behind the Seen: The functional specialization of the brain in space and time , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[73]  M E Demorest,et al.  Speechreading sentences with single-channel vibrotactile presentation of voice fundamental frequency. , 1990, The Journal of the Acoustical Society of America.

[74]  L. Bernstein,et al.  Speechreading and the structure of the lexicon: computationally modeling the effects of reduced phonetic distinctiveness on lexical uniqueness. , 1997, The Journal of the Acoustical Society of America.

[75]  S. Hochstein,et al.  Reverse hierarchies and sensory learning , 2009, Philosophical Transactions of the Royal Society B: Biological Sciences.

[76]  A M Liberman,et al.  Perception of the speech code. , 1967, Psychological review.

[77]  D. Poeppel,et al.  The cortical organization of speech processing , 2007, Nature Reviews Neuroscience.

[78]  R. Nieuwenhuys The neocortex , 1994, Anatomy and Embryology.

[79]  L. Bernstein,et al.  Stimulus-based lexical distinctiveness as a general word-recognition mechanism , 2002, Perception & psychophysics.

[80]  Sophie K Scott,et al.  Auditory processing — speech, space and auditory objects , 2005, Current Opinion in Neurobiology.

[81]  D. Scott Perceptual learning. , 1974, Queen's nursing journal.

[82]  M. Riesenhuber,et al.  Categorization Training Results in Shape- and Category-Selective Human Neural Plasticity , 2007, Neuron.

[83]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[84]  Michael S. Beauchamp,et al.  A neural basis for interindividual differences in the McGurk effect, a multisensory speech illusion , 2012, NeuroImage.

[85]  A. Ghazanfar,et al.  Is neocortex essentially multisensory? , 2006, Trends in Cognitive Sciences.

[86]  A Boothroyd,et al.  Spatial, tactile presentation of voice fundamental frequency as a supplement to lipreading: results of extended training with a single subject. , 1988, Journal of rehabilitation research and development.

[87]  Wei Ji Ma,et al.  Lip-Reading Aids Word Recognition Most in Moderate Noise: A Bayesian Explanation Using High-Dimensional Feature Space , 2009, PloS one.

[88]  Leslie G. Ungerleider,et al.  ‘What’ and ‘where’ in the human brain , 1994, Current Opinion in Neurobiology.

[89]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[90]  M. Tarr,et al.  Visual Object Recognition , 1996, ISTCS.

[91]  Hani Yehia,et al.  Quantitative association of vocal-tract and facial behavior , 1998, Speech Commun..

[92]  J. Obleser,et al.  Pre-lexical abstraction of speech in the auditory cortex , 2009, Trends in Cognitive Sciences.

[93]  Christoph Kayser,et al.  Multisensory interactions in primate auditory cortex: fMRI and electrophysiology , 2009, Hearing Research.