A Gestalt inference model for auditory scene segregation

Our current understanding of how the brain segregates auditory scenes into meaningful objects is in line with a Gestaltism framework. These Gestalt principles suggest a theory of how different attributes of the soundscape are extracted then bound together into separate groups that reflect different objects or streams present in the scene. These cues are thought to reflect the underlying statistical structure of natural sounds in a similar way that statistics of natural images are closely linked to the principles that guide figure-ground segregation and object segmentation in vision. In the present study, we leverage inference in stochastic neural networks to learn emergent grouping cues directly from natural soundscapes including speech, music and sounds in nature. The model learns a hierarchy of local and global spectro-temporal attributes reminiscent of simultaneous and sequential Gestalt cues that underlie the organization of auditory scenes. These mappings operate at multiple time scales to analyze an incoming complex scene and are then fused using a Hebbian network that binds together coherent features into perceptually-segregated auditory objects. The proposed architecture successfully emulates a wide range of well established auditory scene segregation phenomena and quantifies the complimentary role of segregation and binding cues in driving auditory scene segregation.

[1]  Douglas Johnson,et al.  Stream Segregation and Peripheral Channeling , 1991 .

[2]  Barbara G Shinn-Cunningham,et al.  A sound element gets lost in perceptual competition , 2007, Proceedings of the National Academy of Sciences.

[3]  S. Shamma,et al.  Spectro-temporal modulation transfer functions and speech intelligibility. , 1999, The Journal of the Acoustical Society of America.

[4]  L. van Noorden Minimun differences of level and frequency for perceptual fission of tone sequences ABAB. , 1977, The Journal of the Acoustical Society of America.

[5]  Josh H. McDermott The cocktail party problem , 2009, Current Biology.

[6]  Sridhar Krishna Nemala,et al.  A Multistream Feature Framework Based on Bandpass Modulation Filtering for Robust Speech Recognition , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Shihab A. Shamma,et al.  Sound stream segregation: a neuromorphic approach to solve the “cocktail party problem” in real-time , 2015, Front. Neurosci..

[8]  Chang Liu,et al.  Psychometric properties of the coordinate response measure corpus with various types of background interference. , 2012, The Journal of the Acoustical Society of America.

[9]  D. Pressnitzer,et al.  Perceptual Organization of Sound Begins in the Auditory Periphery , 2008, Current Biology.

[10]  Konrad P. Körding,et al.  Sparse Spectrotemporal Coding of Sounds , 2003, EURASIP J. Adv. Signal Process..

[11]  Ana B. Chica,et al.  Attentional Routes to Conscious Perception , 2012, Front. Psychology.

[12]  Alexander I. Rudnicky,et al.  Auditory segregation: stream or streams? , 1975, Journal of experimental psychology. Human perception and performance.

[13]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[14]  Frédéric E. Theunissen,et al.  The Modulation Transfer Function for Speech Intelligibility , 2009, PLoS Comput. Biol..

[15]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[16]  C J Darwin,et al.  Grouping in pitch perception: evidence for sequential constraints. , 1995, The Journal of the Acoustical Society of America.

[17]  Tara N. Sainath,et al.  Unsupervised Audio Segmentation using Extended Baum-Welch Transformations , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[18]  M. Schönwiesner,et al.  Spectro-temporal modulation transfer function of single voxels in the human auditory cortex measured with high-resolution fMRI , 2009, Proceedings of the National Academy of Sciences.

[19]  R. Carlyon How the brain separates sounds , 2004, Trends in Cognitive Sciences.

[20]  Karl J. Friston Hierarchical Models in the Brain , 2008, PLoS Comput. Biol..

[21]  R. Meddis,et al.  A Computer Model of Auditory Stream Segregation , 1991, The Quarterly journal of experimental psychology. A, Human experimental psychology.

[22]  S. Shamma,et al.  Ripple Analysis in Ferret Primary Auditory Cortex. I. Response Characteristics of Single Units to Sinusoidally Rippled Spectra , 1994 .

[23]  Mounya Elhilali,et al.  Information-bearing components of speech intelligibility under babble-noise and bandlimiting distortions , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Alon Fishbach,et al.  Primary auditory cortex of cats: feature detection or something else? , 2003, Biological Cybernetics.

[25]  Lee M. Miller,et al.  Naturalistic Auditory Contrast Improves Spectrotemporal Coding in the Cat Inferior Colliculus , 2003, The Journal of Neuroscience.

[26]  Israel Nelken,et al.  Responses of auditory cortex to complex stimuli: functional organization revealed using intrinsic optical signals. , 2008, Journal of neurophysiology.

[27]  Mounya Elhilali,et al.  A cocktail party with a cortical twist: how cortical mechanisms contribute to sound segregation. , 2008, The Journal of the Acoustical Society of America.

[28]  J. Rauschecker,et al.  The role of auditory cortex in the formation of auditory streams , 2007, Hearing Research.

[29]  Mounya Elhilali,et al.  Temporal coherence and the streaming of complex sounds. , 2013, Advances in experimental medicine and biology.

[30]  Nicole L. Carlson,et al.  Sparse Codes for Speech Predict Spectrotemporal Receptive Fields in the Inferior Colliculus , 2012, PLoS Comput. Biol..

[31]  D. Ballard,et al.  Eye guidance in natural vision: reinterpreting salience. , 2011, Journal of vision.

[32]  Brian C J Moore,et al.  Effects of the build-up and resetting of auditory stream segregation on temporal discrimination. , 2008, Journal of experimental psychology. Human perception and performance.

[33]  S. Palmer,et al.  A century of Gestalt psychology in visual perception: I. Perceptual grouping and figure-ground organization. , 2012, Psychological bulletin.

[34]  W. T. Nelson,et al.  A speech corpus for multitalker communications research. , 2000, The Journal of the Acoustical Society of America.

[35]  Huiyu Zhou,et al.  Object tracking using SIFT features and mean shift , 2009, Comput. Vis. Image Underst..

[36]  Vincent Fontaine,et al.  Automatic classification of environmental noise events by hidden Markov models , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[37]  Randolph Blake,et al.  The role of temporal structure in human vision. , 2005, Behavioral and cognitive neuroscience reviews.

[38]  Amos J. Storkey,et al.  The basins of attraction of a new Hopfield learning rule , 1999, Neural Networks.

[39]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[40]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[41]  David J. Fleet,et al.  Dynamical binary latent variable models for 3D human pose tracking , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[42]  D. M. Green,et al.  Signal detection theory and psychophysics , 1966 .

[43]  Thierry Aubin Penguins and their noisy world. , 2004, Anais da Academia Brasileira de Ciencias.

[44]  Xiaohui Xie,et al.  Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network , 2003, Neural Computation.

[45]  C. Schreiner,et al.  Nonlinear Spectrotemporal Sound Analysis by Neurons in the Auditory Midbrain , 2002, The Journal of Neuroscience.

[46]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[47]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[48]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[50]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[51]  Anne Hsu,et al.  Tuning for spectro-temporal modulations as a mechanism for auditory discrimination of natural sounds , 2005, Nature Neuroscience.

[52]  Guy J. Brown,et al.  Separation of speech from interfering sounds based on oscillatory correlation , 1999, IEEE Trans. Neural Networks.

[53]  T. Poggio,et al.  Hierarchical models of object recognition in cortex , 1999, Nature Neuroscience.

[54]  Mari Tervaniemi,et al.  Grouping of Sequential SoundsAn Event-Related Potential Study Comparing Musicians and Nonmusicians , 2004, Journal of Cognitive Neuroscience.

[55]  L. V. Noorden Temporal coherence in the perception of tone sequences , 1975 .

[56]  I. Nelken,et al.  Multiple Time Scales of Adaptation in Auditory Cortex Neurons , 2004, The Journal of Neuroscience.

[57]  Xiaoqin Wang,et al.  The harmonic organization of auditory cortex , 2013, Front. Syst. Neurosci..

[58]  J. Rauschecker,et al.  Perceptual Organization of Tone Sequences in the Auditory Cortex of Awake Macaques , 2005, Neuron.

[59]  S. Shamma,et al.  Temporal Coherence in the Perceptual Organization and Cortical Representation of Auditory Scenes , 2009, Neuron.

[60]  Shihab Shamma,et al.  Temporal coherence versus harmonicity in auditory stream formation. , 2013, The Journal of the Acoustical Society of America.

[61]  V. Ciocca The auditory organization of complex sounds. , 2008, Frontiers in bioscience : a journal and virtual library.

[62]  Volker Hohmann,et al.  Combined Estimation of Spectral Envelopes and Sound Source Direction of Concurrent Voices by Multidimensional Statistical Filtering , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[63]  S. H. Hulse,et al.  Auditory scene analysis by songbirds: stream segregation of birdsong by European starlings (Sturnus vulgaris). , 1997, Journal of comparative psychology.

[64]  J J Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[65]  Tapani Raiko,et al.  Enhanced Gradient for Training Restricted Boltzmann Machines , 2013, Neural Computation.

[66]  André Brechmann,et al.  The Build-up of Auditory Stream Segregation: A Different Perspective , 2012, Front. Psychology.

[67]  Henry Markram,et al.  Neural Networks with Dynamic Synapses , 1998, Neural Computation.

[68]  Dieter Fox,et al.  Kernel Descriptors for Visual Recognition , 2010, NIPS.

[69]  Mark D. Plumbley,et al.  Deep Karaoke: Extracting Vocals from Musical Mixtures Using a Convolutional Deep Neural Network , 2015, LVA/ICA.

[70]  Akihiro Izumi,et al.  Auditory stream segregation in Japanese monkeys , 2002, Cognition.

[71]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[72]  Michael S. Falconbridge,et al.  A Simple Hebbian/Anti-Hebbian Network Learns the Sparse, Independent Components of Natural Images , 2006, Neural Computation.

[73]  N. Macmillan,et al.  A probe-signal investigation of uncertain-frequency detection. , 1975, The Journal of the Acoustical Society of America.

[74]  M. Wehr,et al.  Nonoverlapping Sets of Synapses Drive On Responses and Off Responses in Auditory Cortex , 2010, Neuron.

[75]  M. Escabí,et al.  Neural mechanisms for spectral analysis in the auditory midbrain, thalamus, and cortex. , 2005, International review of neurobiology.

[76]  Kai Lu,et al.  Temporal coherence structure rapidly shapes neuronal interactions , 2017, Nature Communications.

[77]  Christoph E. Schreiner,et al.  Spatial Distribution of Responses to Simple and Complex Sounds in the Primary Auditory Cortex , 1998, Audiology and Neurotology.

[78]  D S Brungart Evaluation of speech intelligibility with the coordinate response measure. , 2001, The Journal of the Acoustical Society of America.

[79]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[80]  Hualou Liang,et al.  Temporal dynamics of attention-modulated neuronal synchronization in macaque V4 , 2003, Neurocomputing.

[81]  S. Shamma,et al.  Rate Versus Temporal Code? A Spatio-Temporal Coherence Model of the Cortical Basis of Streaming , 2010 .

[82]  T. Griffiths,et al.  What is an auditory object? , 2004, Nature Reviews Neuroscience.

[83]  Georg M Klump,et al.  Primitive auditory stream segregation: a neurophysiological study in the songbird forebrain. , 2004, Journal of neurophysiology.

[84]  James A. O'Sullivan,et al.  Evidence for Neural Computations of Temporal Coherence in an Auditory Scene and Their Enhancement during Active Listening , 2015, The Journal of Neuroscience.

[85]  D. Deutsch,et al.  Perceptual grouping of musical sequences: Pitch and timing as competing cues , 2004 .

[86]  Steven M. Demorest,et al.  The Perceptual Grouping of Musical Sequences : Pitch and Timing as Competing Cues , 2010 .

[87]  Dieter Fox,et al.  Depth kernel descriptors for object recognition , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[88]  I. Nelken,et al.  Modeling the auditory scene: predictive regularity representations and perceptual objects , 2009, Trends in Cognitive Sciences.

[89]  R Meddis,et al.  Computer simulation of auditory stream segregation in alternating-tone sequences. , 1996, The Journal of the Acoustical Society of America.

[90]  N. C. Singh,et al.  Modulation spectra of natural sounds and ethological theories of auditory processing. , 2003, The Journal of the Acoustical Society of America.

[91]  J. L. Goldstein An optimum processor theory for the central formation of the pitch of complex tones. , 1973, The Journal of the Acoustical Society of America.

[92]  Shihab Shamma,et al.  Auditory stream segregation for alternating and synchronous tones. , 2013, Journal of experimental psychology. Human perception and performance.

[93]  Qiang Huang,et al.  Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[94]  C. Darwin Auditory grouping , 1997, Trends in Cognitive Sciences.

[95]  Juyang Weng,et al.  Top–Down Connections in Self-Organizing Hebbian Networks: Topographic Class Grouping , 2010, IEEE Transactions on Autonomous Mental Development.

[96]  Bruno A. Olshausen,et al.  Scene analysis in the natural environment , 2014, Front. Psychol..

[97]  Yurong Liu,et al.  A survey of deep neural network architectures and their applications , 2017, Neurocomputing.

[98]  Timothy Q Gentner,et al.  Central auditory neurons have composite receptive fields , 2016, Proceedings of the National Academy of Sciences.

[99]  Lior Rokach,et al.  Clustering Methods , 2005, The Data Mining and Knowledge Discovery Handbook.

[100]  C. Micheyl,et al.  Auditory stream segregation on the basis of amplitude-modulation rate. , 2002, The Journal of the Acoustical Society of America.

[101]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[102]  Geoffrey E. Hinton,et al.  Factored conditional restricted Boltzmann Machines for modeling motion style , 2009, ICML '09.

[103]  M. Chait,et al.  Neural Correlates of Auditory Figure-Ground Segregation Based on Temporal Coherence , 2016, Cerebral cortex.

[104]  Brian C. J. Moore,et al.  Auditory Processing of Temporal Fine Structure:Effects of Age and Hearing Loss , 2014 .

[105]  Leon van Noorden,et al.  Minimum differences of level and frequency for perceptual fission of tone sequences ABAB , 1977 .

[106]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[107]  J. C. Middlebrooks,et al.  Binaural response-specific bands in primary auditory cortex (AI) of the cat: Topographical organization orthogonal to isofrequency contours , 1980, Brain Research.

[108]  Stephen Grossberg,et al.  ARTSTREAM: a neural network model of auditory scene analysis and source segregation , 2004, Neural Networks.

[109]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[110]  S. Shamma,et al.  Temporal coherence and attention in auditory scene analysis , 2011, Trends in Neurosciences.

[111]  Sylvain Saïghi,et al.  Biomimetic technologies Principles and Applications , 2015 .

[112]  I. Winkler,et al.  ‘Primitive intelligence’ in the auditory cortex , 2001, Trends in Neurosciences.

[113]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[114]  Te-Won Lee,et al.  A Maximum Likelihood Approach to Single-channel Source Separation , 2003, J. Mach. Learn. Res..

[115]  J. C. Middlebrooks Auditory cortex cheers the overture and listens through the finale , 2005, Nature Neuroscience.

[116]  S A Shamma,et al.  Spectro-temporal response field characterization with dynamic ripples in ferret primary auditory cortex. , 2001, Journal of neurophysiology.

[117]  Albert S. Bregman,et al.  The Auditory Scene. (Book Reviews: Auditory Scene Analysis. The Perceptual Organization of Sound.) , 1990 .

[118]  Pascal Fries,et al.  Assessing Neuronal Coherence with Single-Unit, Multi-Unit, and Local Field Potentials , 2006, Neural Computation.

[119]  C J Darwin,et al.  Simultaneous grouping and auditory continuity , 2005, Perception & psychophysics.

[120]  Andrew J Oxenham,et al.  Correct tonotopic representation is necessary for complex pitch perception. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[121]  Kuansan Wang,et al.  Auditory representations of acoustic signals , 1992, IEEE Trans. Inf. Theory.

[122]  Mounya Elhilali,et al.  Modeling the Cocktail Party Problem , 2017 .

[123]  Daniel P. W. Ellis,et al.  Model-Based Monaural Source Separation Using a Vector-Quantized Phase-Vocoder Representation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[124]  Mounya Elhilali,et al.  Segregating Complex Sound Sources through Temporal Coherence , 2014, PLoS Comput. Biol..

[125]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[126]  Michael J. Denham,et al.  A Model of Auditory Streaming , 1995, NIPS.

[127]  David G. Stork,et al.  Pattern Classification , 1973 .

[128]  Andrew Y. Ng,et al.  Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning , 2011, 2011 International Conference on Document Analysis and Recognition.

[129]  Georg M Klump,et al.  Auditory streaming of amplitude-modulated sounds in the songbird forebrain. , 2009, Journal of neurophysiology.

[130]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[131]  Norbert Dillier,et al.  Sound Classification in Hearing Aids Inspired by Auditory Scene Analysis , 2005, EURASIP J. Adv. Signal Process..

[132]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[133]  Ce Schreiner,et al.  Spectral envelope coding in cat primary auditory cortex: Properties of ripple transfer functions , 1994 .

[134]  Xiaoqin Wang,et al.  Temporal and rate representations of time-varying signals in the auditory cortex of awake primates , 2001, Nature Neuroscience.

[135]  C. Atencio,et al.  Hierarchical representations in the auditory cortex , 2011, Current Opinion in Neurobiology.

[136]  Mounya Elhilali,et al.  A linear systems view to the concept of STRFs , 2013 .

[137]  J. Pickles An Introduction to the Physiology of Hearing , 1982 .

[138]  Richard R Fay,et al.  Auditory stream segregation in goldfish (Carassius auratus) , 1998, Hearing Research.

[139]  Frédéric Jurie,et al.  Sampling Strategies for Bag-of-Features Image Classification , 2006, ECCV.

[140]  Xiaojun Qiu,et al.  Near-field sensing strategies for the active control of the sound radiated from a plate , 1999 .

[141]  Virginia Best,et al.  Binaural interference and auditory grouping. , 2007, The Journal of the Acoustical Society of America.

[142]  Lee M. Miller,et al.  Spectrotemporal receptive fields in the lemniscal auditory thalamus and cortex. , 2002, Journal of neurophysiology.

[143]  Drew H. Abney,et al.  Journal of Experimental Psychology : Human Perception and Performance Influence of Musical Groove on Postural Sway , 2015 .

[144]  Brian Roberts,et al.  Build-up of the tendency to segregate auditory streams: resetting effects evoked by a single deviant tone. , 2010, The Journal of the Acoustical Society of America.

[145]  DeLiang Wang,et al.  An oscillatory correlation model of auditory streaming , 2008, Cognitive Neurodynamics.

[146]  Nima Mesgarani,et al.  Speaker-Independent Speech Separation With Deep Attractor Network , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[147]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[148]  DeLiang Wang,et al.  A Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.