Simulation of talking faces in the human brain improves auditory speech recognition

Human face-to-face communication is essentially audiovisual. Typically, people talk to us face-to-face, providing concurrent auditory and visual input. Understanding someone is easier when there is visual input, because visual cues like mouth and tongue movements provide complementary information about speech content. Here, we hypothesized that, even in the absence of visual input, the brain optimizes both auditory-only speech and speaker recognition by harvesting speaker-specific predictions and constraints from distinct visual face-processing areas. To test this hypothesis, we performed behavioral and neuroimaging experiments in two groups: subjects with a face recognition deficit (prosopagnosia) and matched controls. The results show that observing a specific person talking for 2 min improves subsequent auditory-only speech and speaker recognition for this person. In both prosopagnosics and controls, behavioral improvement in auditory-only speech recognition was based on an area typically involved in face-movement processing. Improvement in speaker recognition was only present in controls and was based on an area involved in face-identity processing. These findings challenge current unisensory models of speech processing, because they show that, in auditory-only speech, the brain exploits previously encoded audiovisual correlations to optimize communication. We suggest that this optimization is based on speaker-specific audiovisual internal models, which are used to simulate a talking face.

[1]  H. Wilson,et al.  fMRI evidence for the neural representation of faces , 2005, Nature Neuroscience.

[2]  Peter Dayan,et al.  Images, Frames, and Connectionist Hierarchies , 2006, Neural Computation.

[3]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[4]  A. Giraud,et al.  Implicit Multisensory Associations Influence Voice Recognition , 2006, PLoS biology.

[5]  Yizhar Lavner,et al.  The effects of acoustic modifications on the identification of familiar voices speaking isolated vowels , 2000, Speech Commun..

[6]  Geoffrey E. Hinton,et al.  Parallel visual computation , 1983, Nature.

[7]  William M. Stern,et al.  Shape conveyed by visual-to-auditory sensory substitution activates the lateral occipital complex , 2007, Nature Neuroscience.

[8]  Andreas Kleinschmidt,et al.  Scale invariant adaptation in fusiform face-responsive regions , 2004, NeuroImage.

[9]  L. Braida Crossmodal Integration in the Identification of Consonant Segments , 1991, The Quarterly journal of experimental psychology. A, Human experimental psychology.

[10]  Alexandre Pouget,et al.  Optimal Sensorimotor Integration in Recurrent Cortical Networks: A Neural Implementation of Kalman Filters , 2007, The Journal of Neuroscience.

[11]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[12]  D. Poeppel,et al.  The cortical organization of speech processing , 2007, Nature Reviews Neuroscience.

[13]  E. Vatikiotis-Bateson,et al.  `Putting the Face to the Voice' Matching Identity across Modality , 2003, Current Biology.

[14]  Higbee Rv Some reactions of emotionally disturbed children to academic retardation. , 1954 .

[15]  Rajesh P. N. Rao,et al.  Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. , 1999 .

[16]  Claus-Christian Carbon,et al.  Faces as Objects of Non-Expertise: Processing of Thatcherised Faces in Congenital Prosopagnosia , 2007, Perception.

[17]  A. Treves,et al.  Morphing Marilyn into Maggie dissociates physical and identity face representations in the brain , 2005, Nature Neuroscience.

[18]  Gilles Pourtois,et al.  Fear recognition in the voice is modulated by unconsciously recognized facial expressions but not by unconsciously recognized affective pictures , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[19]  S. M. Sheffert,et al.  Audiovisual speech facilitates voice learning , 2004, Perception & psychophysics.

[20]  A. O'Toole,et al.  Recognizing moving faces: a psychological and neural synthesis , 2002, Trends in Cognitive Sciences.

[21]  David Poeppel,et al.  Visual speech speeds up the neural processing of auditory speech. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Jonas Beskow,et al.  Evaluation of a multilingual synthetic talking face as a communication aid for the hearing impaired , 2002 .

[23]  T. Allison,et al.  Temporal Cortex Activation in Humans Viewing Eye and Mouth Movements , 1998, The Journal of Neuroscience.

[24]  Anne-Lise Giraud,et al.  Voice recognition and cross-modal responses to familiar speakers' voices in prosopagnosia. , 2006, Cerebral cortex.

[25]  J. Haxby,et al.  The distributed human neural system for face perception , 2000, Trends in Cognitive Sciences.

[26]  P. Belin,et al.  Thinking the voice: neural correlates of voice perception , 2004, Trends in Cognitive Sciences.

[27]  Karl J. Friston,et al.  A theory of cortical responses , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[28]  Ingo Kennerknecht,et al.  First report of prevalence of non‐syndromic hereditary prosopagnosia (HPA) , 2006, American journal of medical genetics. Part A.

[29]  Karl Sperling,et al.  Hereditary Prosopagnosia: the First Case Series , 2007, Cortex.

[30]  R. Dolan,et al.  Unconscious fear influences emotional awareness of faces and voices. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Hani Yehia,et al.  Quantitative association of vocal-tract and facial behavior , 1998, Speech Commun..

[32]  M. Halle From Memory to Speech and Back: Papers on Phonetics and Phonology 1954 - 2002 , 2003 .

[33]  Dylan M. Jones,et al.  Intra- and inter-modal repetition priming of familiar faces and voices. , 1997, British journal of psychology.

[34]  Mitsuo Kawato,et al.  A forward-inverse optics model of reciprocal connections between visual cortical areas , 1993 .

[35]  Michael I. Jordan,et al.  An internal model for sensorimotor integration. , 1995, Science.

[36]  S. Schweinberger,et al.  Hearing Facial Identities , 2007, Quarterly journal of experimental psychology.

[37]  Galia Avidan,et al.  A detailed investigation of facial expression processing in congenital prosopagnosia as compared to acquired prosopagnosia , 2006, Experimental Brain Research.

[38]  E. Bullmore,et al.  Activation of auditory cortex during silent lipreading. , 1997, Science.

[39]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[40]  K. Lander,et al.  Exploring the role of characteristic motion when learning new faces , 2007, Quarterly journal of experimental psychology.

[41]  M. Bar The proactive brain: using analogies and associations to generate predictions , 2007, Trends in Cognitive Sciences.

[42]  Joost X. Maier,et al.  Multisensory Integration of Dynamic Faces and Voices in Rhesus Monkey Auditory Cortex , 2005 .

[43]  N. Kanwisher,et al.  The fusiform face area: a cortical region specialized for the perception of faces , 2006, Philosophical Transactions of the Royal Society B: Biological Sciences.

[44]  W. Richards,et al.  Perception as Bayesian Inference , 2008 .