A Computational Model for Combinatorial Generalization in Physical Perception from Sound

Humans possess the unique ability of combinatorial generalization in auditory perception: given novel auditory stimuli, humans perform auditory scene analysis and infer causal physical interactions based on prior knowledge. Could we build a computational model that achieves combinatorial generalization? In this paper, we present a case study on box-shaking: having heard only the sound of a single ball moving in a box, we seek to interpret the sound of two or three balls of different materials. To solve this task, we propose a hybrid model with two components: a neural network for perception, and a physical audio engine for simulation. We use the outcome of the network as an initial guess and perform MCMC sampling with the audio engine to improve the result. Combining neural networks with a physical audio engine, our hybrid model achieves combinatorial generalization efficiently and accurately in auditory scene perception.

[1]  Jiajun Wu,et al.  Generative Modeling of Audible Shapes for Object Perception , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  Jiajun Wu,et al.  Galileo: Perceiving Physical Object Properties by Integrating a Physics Engine with Deep Learning , 2015, NIPS.

[3]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Eero P. Simoncelli,et al.  Article Sound Texture Perception via Statistics of the Auditory Periphery: Evidence from Sound Synthesis , 2022 .

[5]  Jiajun Wu,et al.  Shape and Material from Sound , 2017, NIPS.

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Hugo Fastl,et al.  Psychoacoustics: Facts and Models , 1990 .

[8]  M. Turvey,et al.  Hearing shape. , 2000, Journal of experimental psychology. Human perception and performance.

[9]  Erwin Coumans,et al.  Bullet physics simulation , 2015, SIGGRAPH Courses.

[10]  Vikash K. Mansinghka,et al.  Reconciling intuitive physics and Newtonian mechanics for colliding objects. , 2013, Psychological review.

[11]  Dinesh K. Pai,et al.  Precomputed acoustic transfer: output-sensitive, accurate sound generation for geometrically complex vibration sources , 2006, SIGGRAPH 2006.

[12]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Jessica B. Hamrick,et al.  Simulation as an engine of physical scene understanding , 2013, Proceedings of the National Academy of Sciences.