The Effects of Audiovisual Inputs on Solving the Cocktail Party Problem in the Human Brain: An fMRI Study

Abstract At cocktail parties, our brains often simultaneously receive visual and auditory information. Although the cocktail party problem has been widely investigated under auditory‐only settings, the effects of audiovisual inputs have not. This study explored the effects of audiovisual inputs in a simulated cocktail party. In our fMRI experiment, each congruent audiovisual stimulus was a synthesis of 2 facial movie clips, each of which could be classified into 1 of 2 emotion categories (crying and laughing). Visual‐only (faces) and auditory‐only stimuli (voices) were created by extracting the visual and auditory contents from the synthesized audiovisual stimuli. Subjects were instructed to selectively attend to 1 of the 2 objects contained in each stimulus and to judge its emotion category in the visual‐only, auditory‐only, and audiovisual conditions. The neural representations of the emotion features were assessed by calculating decoding accuracy and brain pattern‐related reproducibility index based on the fMRI data. We compared the audiovisual condition with the visual‐only and auditory‐only conditions and found that audiovisual inputs enhanced the neural representations of emotion features of the attended objects instead of the unattended objects. This enhancement might partially explain the benefits of audiovisual inputs for the brain to solve the cocktail party problem.

[1]  E. C. Cherry Some Experiments on the Recognition of Speech, with One and with Two Ears , 1953 .

[2]  J. Sergent,et al.  Functional neuroanatomy of face and object processing. A positron emission tomography study. , 1992, Brain : a journal of neurology.

[3]  Karl J. Friston,et al.  Statistical parametric maps in functional imaging: A general linear approach , 1994 .

[4]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[5]  T. Allison,et al.  Temporal Cortex Activation in Humans Viewing Eye and Mouth Movements , 1998, The Journal of Neuroscience.

[6]  Joseph E LeDoux,et al.  Human Amygdala Activation during Conditioned Fear Acquisition and Extinction: a Mixed-Trial fMRI Study , 1998, Neuron.

[7]  Terrence J. Sejnowski,et al.  Learning Overcomplete Representations , 2000, Neural Computation.

[8]  J. Haxby,et al.  The distributed human neural system for face perception , 2000, Trends in Cognitive Sciences.

[9]  R. Campbell,et al.  Evidence from functional magnetic resonance imaging of crossmodal binding in the human heteromodal cortex , 2000, Current Biology.

[10]  G. Mangun,et al.  The neural mechanisms of top-down attentional control , 2000, Nature Neuroscience.

[11]  K. Grant,et al.  The effect of speechreading on masked detection thresholds for filtered speech. , 2001, The Journal of the Acoustical Society of America.

[12]  R L Freyman,et al.  Spatial release from informational masking in speech recognition. , 2001, The Journal of the Acoustical Society of America.

[13]  T. Sejnowski,et al.  Independent component analysis at the neural cocktail party , 2001, Trends in Neurosciences.

[14]  J. Desmond,et al.  Material-specific lateralization in the medial temporal lobe and prefrontal cortex during memory encoding. , 2001, Brain : a journal of neurology.

[15]  R. Dolan,et al.  Effects of Attention and Emotion on Face Processing in the Human Brain An Event-Related fMRI Study , 2001, Neuron.

[16]  N. Sadato,et al.  Neural Interaction of the Amygdala with the Prefrontal and Temporal Cortices in the Processing of Facial Expressions as Revealed by fMRI , 2001, Journal of Cognitive Neuroscience.

[17]  N. Bolognini,et al.  Enhancement of visual perception by crossmodal visuo-auditory interaction , 2002, Experimental Brain Research.

[18]  Leslie G. Ungerleider,et al.  Neural processing of emotional faces requires attention , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Thomas E. Nichols,et al.  Controlling the familywise error rate in functional neuroimaging: a comparative review , 2003, Statistical methods in medical research.

[20]  M. Hallett,et al.  Neural correlates of cross-modal binding , 2003, Nature Neuroscience.

[21]  J. Schwartz,et al.  Seeing to hear better: evidence for early audio-visual interactions in speech identification , 2004, Cognition.

[22]  Edward T. Bullmore,et al.  Differential neural responses to overt and covert presentations of facial expressions of fear and disgust , 2000, NeuroImage.

[23]  A. Fort,et al.  Bimodal speech: early suppressive visual effects in human auditory cortex , 2004, The European journal of neuroscience.

[24]  E Macaluso,et al.  Spatial and temporal factors during processing of audiovisual speech: a PET study , 2004, NeuroImage.

[25]  Yuanqing Li,et al.  Analysis of Sparse Representation and Blind Source Separation , 2004, Neural Computation.

[26]  G. Calvert,et al.  Multisensory integration: methodological approaches and emerging principles in the human brain , 2004, Journal of Physiology-Paris.

[27]  Frederick J. Gallun,et al.  The advantage of knowing where to listen. , 2005, The Journal of the Acoustical Society of America.

[28]  W. Hartmann,et al.  The role of reverberation in release from masking due to spatial separation of sources for speech identification , 2005 .

[29]  E. Macaluso,et al.  Multisensory spatial interactions: a window onto functional integration in the human brain , 2005, Trends in Neurosciences.

[30]  A. Vanlierde,et al.  Specific activation of the V5 brain area by auditory motion processing: an fMRI study. , 2005, Brain research. Cognitive brain research.

[31]  Vaidehi S. Natu,et al.  Category-Specific Cortical Activity Precedes Retrieval During Memory Search , 2005, Science.

[32]  Simon Haykin,et al.  The Cocktail Party Problem , 2005, Neural Computation.

[33]  Rainer Goebel,et al.  Information-based functional brain mapping. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Neil L. Aaronson,et al.  Release from speech-on-speech masking by adding a delayed masker at a different location. , 2006, The Journal of the Acoustical Society of America.

[35]  S. Schweinberger,et al.  Hearing Facial Identities , 2007, Quarterly journal of experimental psychology.

[36]  Justin S. Feinstein,et al.  Increased amygdala and insula activation during emotion processing in anxiety-prone subjects. , 2007, The American journal of psychiatry.

[37]  Michael Erb,et al.  Audiovisual integration of emotional signals in voice and face: An event-related fMRI study , 2007, NeuroImage.

[38]  J. Rieger,et al.  Audiovisual Temporal Correspondence Modulates Human Multisensory Superior Temporal Sulcus Plus Primary Sensory Cortices , 2007, The Journal of Neuroscience.

[39]  J. Gallant,et al.  Identifying natural images from human brain activity , 2008, Nature.

[40]  John J. Foxe,et al.  Look who's talking: The deployment of visuo-spatial attention during multisensory speech processing under noisy environmental conditions , 2008, NeuroImage.

[41]  E. C. Cmm,et al.  on the Recognition of Speech, with , 2008 .

[42]  W. Singer,et al.  Capture of Auditory Motion by Vision Is Represented by an Activation Shift from Auditory to Visual Motion Cortex , 2008, The Journal of Neuroscience.

[43]  Daniel S. Kislyuk,et al.  The effect of viewing speech on auditory speech processing is different in the left and right hemispheres , 2008, Brain Research.

[44]  F. Gosselin,et al.  Audio-visual integration of emotion expression , 2008, Brain Research.

[45]  Tom Michael Mitchell,et al.  Predicting Human Brain Activity Associated with the Meanings of Nouns , 2008, Science.

[46]  Nikolaus Kriegeskorte,et al.  Frontiers in Systems Neuroscience Systems Neuroscience , 2022 .

[47]  Xihong Wu,et al.  Transient auditory storage of acoustic details is associated with release of speech from informational masking in reverberant conditions. , 2009, Journal of experimental psychology. Human perception and performance.

[48]  Tom M. Mitchell,et al.  Machine learning classifiers and fMRI: A tutorial overview , 2009, NeuroImage.

[49]  Josh H. McDermott The cocktail party problem , 2009, Current Biology.

[50]  S. Shamma,et al.  Interaction between Attention and Bottom-Up Saliency Mediates the Representation of Foreground and Background in an Auditory Scene , 2009, PLoS biology.

[51]  P. McGuire,et al.  Functional atlas of emotional faces processing: a voxel-based meta-analysis of 105 functional magnetic resonance imaging studies. , 2009, Journal of psychiatry & neuroscience : JPN.

[52]  R. Goebel,et al.  Multisensory functional magnetic resonance imaging: a future perspective , 2009, Experimental Brain Research.

[53]  Joost X. Maier,et al.  Natural, Metaphoric, and Linguistic Auditory Direction Signals Have Distinct Influences on Visual Motion Processing , 2009, The Journal of Neuroscience.

[54]  Lee M. Miller,et al.  A Multisensory Cortical Network for Understanding Speech in Noise , 2009, Journal of Cognitive Neuroscience.

[55]  U. Noppeney,et al.  Distinct Functional Contributions of Primary Sensory and Association Areas to Audiovisual Integration in Object Categorization , 2010, The Journal of Neuroscience.

[56]  A. Oxenham,et al.  Objective and Subjective Psychophysical Measures of Auditory Stream Integration and Segregation , 2010, Journal of the Association for Research in Otolaryngology.

[57]  Anil K. Seth,et al.  A MATLAB toolbox for Granger causal connectivity analysis , 2010, Journal of Neuroscience Methods.

[58]  Reginald B. Adams,et al.  The neural basis of categorical face perception: graded representations of face gender in fusiform and orbitofrontal cortices. , 2010, Cerebral cortex.

[59]  U. Noppeney,et al.  Audiovisual Synchrony Improves Motion Discrimination via Enhanced Connectivity between Early Visual and Auditory Areas , 2010, The Journal of Neuroscience.

[60]  Bryan R. Conroy,et al.  A Common, High-Dimensional Model of the Representational Space in Human Ventral Temporal Cortex , 2011, Neuron.

[61]  Liang Li,et al.  Human auditory cortex activity shows additive effects of spectral and spatial cues during speech segregation. , 2011, Cerebral cortex.

[62]  J. P. Hamilton,et al.  Investigating neural primacy in Major Depressive Disorder: Multivariate granger causality analysis of resting-state fMRI time-series data , 2010, Molecular Psychiatry.

[63]  Mikko Sams,et al.  Attention-driven auditory cortex short-term plasticity helps segregate relevant sounds from noise , 2011, Proceedings of the National Academy of Sciences.

[64]  Audrey R. Nath,et al.  Dynamic Changes in Superior Temporal Sulcus Connectivity during Perception of Noisy Audiovisual Speech , 2011, The Journal of Neuroscience.

[65]  S. Schweinberger,et al.  Hearing facial identities: Brain correlates of face–voice integration in person identification , 2011, Cortex.

[66]  Otto Muzik,et al.  Congruence of happy and sad emotion in music and faces modifies cortical audiovisual activation , 2011, NeuroImage.

[67]  D. Hu,et al.  Identifying major depression using whole-brain functional connectivity: a multivariate pattern analysis. , 2012, Brain : a journal of neurology.

[68]  J. Simon,et al.  Neural coding of continuous speech in auditory cortex during monaural and dichotic listening. , 2012, Journal of neurophysiology.

[69]  S. Scott,et al.  Speech comprehension aided by multiple modalities: Behavioural and neural interactions , 2012, Neuropsychologia.

[70]  Simon B. Eickhoff,et al.  Crossmodal Interactions in Audiovisual Emotion Processing Veronika I. Studies of Audiovisual Integration Have Predominantly Assessed the Neural Correlates of Audiovisual Speech Perception (beauchamp Unimodal Emotional Processing in Turn Has Been Extensively Studied, with a High Proportion of Studies , 2022 .

[71]  N. Mesgarani,et al.  Selective cortical representation of attended speaker in multi-talker speech perception , 2012, Nature.

[72]  C. Schroeder,et al.  Attention modulates ‘speech-tracking’ at a cocktail party , 2012, Trends in Cognitive Sciences.

[73]  D. Poeppel,et al.  Mechanisms Underlying Selective Neuronal Tracking of Attended Speech at a “Cocktail Party” , 2013, Neuron.

[74]  S. Campanella,et al.  Integrating face and voice in person perception , 2007, Trends in Cognitive Sciences.

[75]  Gregory B. Cogan,et al.  Visual Input Enhances Selective Speech Envelope Tracking in Auditory Cortex at a “Cocktail Party” , 2013, The Journal of Neuroscience.

[76]  Steven A. Hillyard,et al.  Audio-visual synchrony modulates the ventriloquist illusion and its neural/spatial representation in the auditory cortex , 2014, NeuroImage.

[77]  J. S. Guntupalli,et al.  Decoding neural representational spaces using multivariate pattern analysis. , 2014, Annual review of neuroscience.

[78]  A. Roebroeck,et al.  A network analysis of audiovisual affective speech perception , 2014, Neuroscience.

[79]  Yuanqing Li,et al.  Crossmodal integration enhances neural representation of task-relevant features in audiovisual face perception. , 2015, Cerebral cortex.

[80]  Daniela Perani,et al.  Decoding the neural representation of fine-grained conceptual categories , 2016, NeuroImage.

[81]  Adrian K. C. Lee,et al.  Defining Auditory-Visual Objects: Behavioral Tests and Physiological Mechanisms , 2016, Trends in Neurosciences.

[82]  Michael J. Crosse,et al.  Eye Can Hear Clearly Now: Inverse Effectiveness in Natural Audiovisual Speech Processing Relies on Long-Term Crossmodal Temporal Integration , 2016, The Journal of Neuroscience.

[83]  Yuanqing Li,et al.  The modulatory effect of semantic familiarity on the audiovisual integration of face‐name pairs , 2016, Human brain mapping.

[84]  Christoph Kayser,et al.  Sounds facilitate visual motion discrimination via the enhancement of late occipital visual representations , 2017, NeuroImage.