Ecological origins of perceptual grouping principles in the auditory system

Significance Events and objects must be inferred from sensory signals. Because sensory measurements are temporally and spatially local, the estimation of an object or event can be viewed as the grouping of these measurements into representations of their common causes. Perceptual grouping is believed to reflect internalized regularities of the natural world, yet grouping cues have traditionally been identified using informal observation. Here, we derive auditory grouping cues by measuring and summarizing statistics of natural sound features. Feature co-occurrence statistics reproduced established cues but also, revealed previously unappreciated grouping principles. The results suggest that auditory grouping is adapted to natural stimulus statistics, show how these statistics can reveal previously unappreciated grouping phenomena, and provide a framework for studying grouping in natural signals. Events and objects in the world must be inferred from sensory signals to support behavior. Because sensory measurements are temporally and spatially local, the estimation of an object or event can be viewed as the grouping of these measurements into representations of their common causes. Perceptual grouping is believed to reflect internalized regularities of the natural environment, yet grouping cues have traditionally been identified using informal observation and investigated using artificial stimuli. The relationship of grouping to natural signal statistics has thus remained unclear, and additional or alternative cues remain possible. Here, we develop a general methodology for relating grouping to natural sensory signals and apply it to derive auditory grouping cues from natural sounds. We first learned local spectrotemporal features from natural sounds and measured their co-occurrence statistics. We then learned a small set of stimulus properties that could predict the measured feature co-occurrences. The resulting cues included established grouping cues, such as harmonic frequency relationships and temporal coincidence, but also revealed previously unappreciated grouping principles. Human perceptual grouping was predicted by natural feature co-occurrence, with humans relying on the derived grouping cues in proportion to their informativity about co-occurrence in natural sounds. The results suggest that auditory grouping is adapted to natural stimulus statistics, show how these statistics can reveal previously unappreciated grouping phenomena, and provide a framework for studying grouping in natural signals.

[1]  S. Schwerman,et al.  The Physics of Musical Instruments , 1991 .

[2]  Hermann Ney,et al.  Acoustic Modeling of Speech Waveform Based on Multi-Resolution, Neural Network Signal Processing , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  B. Moore,et al.  Thresholds for hearing mistuned partials as separate tones in harmonic complexes. , 1986, The Journal of the Acoustical Society of America.

[4]  S McAdams,et al.  Hearing a mistuned harmonic in an otherwise periodic complex tone. , 1990, The Journal of the Acoustical Society of America.

[5]  Christian E Stilp,et al.  Rapid efficient coding of correlated complex acoustic properties , 2010, Proceedings of the National Academy of Sciences.

[6]  I. Nelken,et al.  Modeling the auditory scene: predictive regularity representations and perceptual objects , 2009, Trends in Cognitive Sciences.

[7]  Jeffrey S. Perry,et al.  Edge co-occurrence in natural images predicts contour grouping performance , 2001, Vision Research.

[8]  N. C. Singh,et al.  Modulation spectra of natural sounds and ethological theories of auditory processing. , 2003, The Journal of the Acoustical Society of America.

[9]  C. Micheyl,et al.  Auditory stream segregation on the basis of amplitude-modulation rate. , 2002, The Journal of the Acoustical Society of America.

[10]  Eero P. Simoncelli,et al.  Article Sound Texture Perception via Statistics of the Auditory Periphery: Evidence from Sound Synthesis , 2022 .

[11]  Brian C J Moore,et al.  Properties of auditory stream formation , 2012, Philosophical Transactions of the Royal Society B: Biological Sciences.

[12]  Jessika Weiss,et al.  Vision Science Photons To Phenomenology , 2016 .

[13]  S A Shamma,et al.  Spectro-temporal response field characterization with dynamic ripples in ferret primary auditory cortex. , 2001, Journal of neurophysiology.

[14]  Daniel P. W. Ellis,et al.  Combining localization cues and source model constraints for binaural source separation , 2011, Speech Commun..

[15]  Josh H. McDermott,et al.  Adaptive and Selective Time Averaging of Auditory Scenes , 2018, Current Biology.

[16]  Guy J. Brown,et al.  Separation of speech from interfering sounds based on oscillatory correlation , 1999, IEEE Trans. Neural Networks.

[17]  Eero P. Simoncelli,et al.  Cardinal rules: Visual orientation perception reflects knowledge of environmental statistics , 2011, Nature Neuroscience.

[18]  S. Shamma,et al.  Temporal coherence and attention in auditory scene analysis , 2011, Trends in Neurosciences.

[19]  R. Carlyon,et al.  Discriminating between coherent and incoherent frequency modulation of complex tones. , 1991, The Journal of the Acoustical Society of America.

[20]  Daniel L. K. Yamins,et al.  A Task-Optimized Neural Network Replicates Human Auditory Behavior, Predicts Brain Responses, and Reveals a Cortical Processing Hierarchy , 2018, Neuron.

[21]  Barbara Shinn-Cunningham,et al.  Spatial cues alone produce inaccurate sound segregation: the effect of interaural time differences. , 2012, The Journal of the Acoustical Society of America.

[22]  C. Darwin Auditory grouping , 1997, Trends in Cognitive Sciences.

[23]  Terrence J. Sejnowski,et al.  Coding Time-Varying Signals Using Sparse, Shift-Invariant Representations , 1998, NIPS.

[24]  Timothy Q Gentner,et al.  Central auditory neurons have composite receptive fields , 2016, Proceedings of the National Academy of Sciences.

[25]  D. Wang,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2008, IEEE Trans. Neural Networks.

[26]  R. W. Hukin,et al.  Perceptual segregation of a harmonic from a vowel by interaural time difference and frequency proximity. , 1997, The Journal of the Acoustical Society of America.

[27]  Mounya Elhilali,et al.  Segregating Complex Sound Sources through Temporal Coherence , 2014, PLoS Comput. Biol..

[28]  Hideki Kawahara,et al.  Inharmonic speech reveals the role of harmonicity in the cocktail party problem , 2018, Nature Communications.

[29]  W. Geisler,et al.  Contributions of ideal observer theory to vision research , 2011, Vision Research.

[30]  Josh H. McDermott,et al.  Attentive Tracking of Sound Sources , 2015, Current Biology.

[31]  Coarticulation • Suprasegmentals,et al.  Acoustic Phonetics , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[32]  Q. Summerfield Book Review: Auditory Scene Analysis: The Perceptual Organization of Sound , 1992 .

[33]  C. Gilbert,et al.  On a common circle: natural scenes and Gestalt rules. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[34]  R. Carlyon How the brain separates sounds , 2004, Trends in Cognitive Sciences.

[35]  Charles Darwin,et al.  Perceptual separation of concurrent vowels: within and across-formant grouping by F0 , 1993 .

[36]  H. Gaskell The precedence effect , 1983, Hearing Research.

[37]  Nicole L. Carlson,et al.  Sparse Codes for Speech Predict Spectrotemporal Receptive Fields in the Inferior Colliculus , 2012, PLoS Comput. Biol..

[38]  F. Attneave Some informational aspects of visual perception. , 1954, Psychological review.

[39]  Bernhard Schölkopf,et al.  Center-surround patterns emerge as optimal predictors for human saccade targets. , 2009, Journal of vision.

[40]  Leon van Noorden,et al.  Minimum differences of level and frequency for perceptual fission of tone sequences ABAB , 1977 .

[41]  Wiktor Mlynarski,et al.  The Opponent Channel Population Code of Sound Location Is an Efficient Representation of Natural Binaural Sounds , 2015, PLoS Comput. Biol..

[42]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[43]  Josh H McDermott,et al.  Statistics of natural reverberation enable perceptual separation of sound and space , 2016, Proceedings of the National Academy of Sciences.

[44]  Jeffrey S. Perry,et al.  Contour statistics in natural images: Grouping across occlusions , 2009, Visual Neuroscience.

[45]  Eero P. Simoncelli,et al.  Summary statistics in auditory perception , 2013, Nature Neuroscience.

[46]  Wiktor Mlynarski,et al.  Ecological origins of perceptual grouping principles in the auditory system , 2019, Proceedings of the National Academy of Sciences.

[47]  David J. Field,et al.  Contour integration by the human visual system: Evidence for a local “association field” , 1993, Vision Research.

[48]  Hynek Hermansky,et al.  Deriving Spectro-temporal Properties of Hearing from Speech Data , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Andrew J. King,et al.  Network Receptive Field Modeling Reveals Extensive Integration and Multi-feature Selectivity in Auditory Cortical Neurons , 2016, PLoS Comput. Biol..

[50]  Michael S. Lewicki,et al.  Efficient coding of natural sounds , 2002, Nature Neuroscience.

[51]  Wilson S. Geisler,et al.  Optimal speed estimation in natural image movies predicts human performance , 2015, Nature Communications.

[52]  Joshua B. Tenenbaum,et al.  Auditory scene analysis as Bayesian inference in sound source models , 2018, CogSci.

[53]  J. Elder,et al.  Ecological statistics of Gestalt laws for the perceptual organization of contours. , 2002, Journal of vision.

[54]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[55]  C. Atencio,et al.  Hierarchical computation in the canonical auditory cortical circuit , 2009, Proceedings of the National Academy of Sciences.

[56]  Josh H McDermott,et al.  Recovering sound sources from embedded repetition , 2011, Proceedings of the National Academy of Sciences.

[57]  C. Darwin,et al.  The Quarterly Journal of Experimental Psychology Section a Human Experimental Psychology Perceptual Grouping of Speech Components Differing in Fundamental Frequency and Onset-time Perceptual Grouping of Speech Components Differing in Fundamental Frequency and Onset-time , 2022 .

[58]  M. Wertheimer Untersuchungen zur Lehre von der Gestalt. II , 1923 .

[59]  C. M. Marin,et al.  Concurrent vowel identification II: Effects of phase, harmonicity and task , 1997 .

[60]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[61]  Nima Mesgarani,et al.  Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[62]  Johannes Burge,et al.  Optimal defocus estimation in individual natural images , 2011, Proceedings of the National Academy of Sciences.

[63]  J. Arezzo,et al.  Auditory stream segregation in monkey auditory cortex: effects of frequency separation, presentation rate, and tone duration. , 2004, The Journal of the Acoustical Society of America.

[64]  Jiri Najemnik,et al.  Optimal stimulus encoders for natural tasks. , 2009, Journal of vision.

[65]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[66]  Josh H McDermott,et al.  Schema learning for the cocktail party problem , 2018, Proceedings of the National Academy of Sciences.

[67]  C. Darwin Perceiving vowels in the presence of another sound: constraints on formant perception. , 1984, The Journal of the Acoustical Society of America.

[68]  Michael S. Lewicki,et al.  Efficient auditory coding , 2006, Nature.

[69]  J. Culling,et al.  Perceptual separation of concurrent speech sounds: absence of across-frequency grouping by common interaural delay. , 1995, The Journal of the Acoustical Society of America.

[70]  Jacob feldman,et al.  Bayesian contour integration , 2001, Perception & psychophysics.

[71]  Wiktor Mlynarski,et al.  Learning Midlevel Auditory Codes from Natural Sound Statistics , 2017, Neural Computation.

[72]  H S Colburn,et al.  Reducing informational masking by sound segregation. , 1994, The Journal of the Acoustical Society of America.

[73]  Daniel Pressnitzer,et al.  Rapid Formation of Robust Auditory Memories: Insights from Noise , 2010, Neuron.

[74]  Josh H. McDermott The cocktail party problem , 2009, Current Biology.

[75]  Max Wertheimer,et al.  Untersuchungen zur Lehre von der Gestalt , .

[76]  A. Yuille,et al.  Object perception as Bayesian inference. , 2004, Annual review of psychology.

[77]  D. Pressnitzer,et al.  Perceptual Organization of Sound Begins in the Auditory Periphery , 2008, Current Biology.

[78]  William W. Gaver What in the World Do We Hear? An Ecological Approach to Auditory Event Perception , 1993 .

[79]  E. Brunswik,et al.  Ecological cue-validity of proximity and of other Gestalt factors. , 1953, The American journal of psychology.

[80]  C. Darwin,et al.  Perceptual separation of simultaneous vowels: within and across-formant grouping by F0. , 1993, The Journal of the Acoustical Society of America.