Auditory scene analysis as Bayesian inference in sound source models

Inferring individual sound sources from the mixture of soundwaves that enters our ear is a central problem in auditory perception, termed auditory scene analysis (ASA). The study of ASA has uncovered a diverse set of illusions that suggest general principles underlying perceptual organization. However, most explanations for these illusions remain intuitive or are narrowly focused, without formal models that predict perceived sound sources from the acoustic waveform. Whether ASA phenomena can be explained by a small set of principles is unclear. We present a Bayesian model based on representations of simple acoustic sources, for which a neural network is used to guide Markov chain Monte Carlo inference. Given a sound waveform, our system infers the number of sources present, parameters defining each source, and the sound produced by each source. This model qualitatively accounts for perceptual judgments on a variety of classic ASA illusions, and can in some cases infer perceptually valid sources from simple audio recordings.

[1]  Josh H McDermott,et al.  Recovering sound sources from embedded repetition , 2011, Proceedings of the National Academy of Sciences.

[2]  J. M. Ackroff,et al.  Auditory Induction: Perceptual Synthesis of Absent Sounds , 1972, Science.

[3]  A. Bregman,et al.  Crossing of Auditory Streams , 1985 .

[4]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[5]  Josh H McDermott,et al.  Spectral completion of partially masked sounds , 2008, Proceedings of the National Academy of Sciences.

[6]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.