The Mirrornet : Learning Audio Synthesizer Controls Inspired by Sensorimotor Interaction

Experiments to understand the sensorimotor neural interactions in the human cortical speech system support the existence of a bidirectional flow of interactions between the auditory and motor regions. Their key function is to enable the brain to ’learn’ how to control the vocal tract for speech production. This idea is the impetus for the recently proposed “MirrorNet”, a constrained autoencoder architecture. In this paper, the MirrorNet is applied to learn, in an unsupervised manner, the controls of a specific audio synthesizer (DIVA) to produce melodies only from their auditory spectrograms. The results demonstrate how the MirrorNet discovers the synthesizer parameters to generate the melodies that closely resemble the original and those of unseen melodies, and even determine the best set parameters to approximate renditions of complex piano melodies generated by a different synthesizer. This generalizability of the MirrorNet illustrates its potential to discover from sensory data the controls of arbitrary motor-plants such as autonomous vehicles.

[1]  Gregory D. Hager,et al.  Temporal Convolutional Networks for Action Segmentation and Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[3]  Laurent Girin,et al.  Learning robust speech representation with an articulatory-regularized variational autoencoder , 2021, Interspeech.

[4]  Victor Talpaert,et al.  Deep Reinforcement Learning for Autonomous Driving: A Survey , 2020, IEEE Transactions on Intelligent Transportation Systems.

[5]  Zoubin Ghahramani,et al.  Computational principles of movement neuroscience , 2000, Nature Neuroscience.

[6]  Xavier Hinaut,et al.  Canary Vocal Sensorimotor Model with RNN Decoder and Low-dimensional GAN Generator , 2021, 2021 IEEE International Conference on Development and Learning (ICDL).

[7]  S. Shamma,et al.  The Music of Silence: Part I: Responses to Musical Imagery Encode Melodic Expectations and Acoustics , 2021, The Journal of Neuroscience.

[8]  S. Shamma,et al.  The Music of Silence: Part II: Music Listening Induces Imagery Responses , 2021, The Journal of Neuroscience.

[9]  Sidney Fels,et al.  Learning Joint Articulatory-Acoustic Representations with Normalizing Flows , 2020, INTERSPEECH.

[10]  Adrien Bardet,et al.  Flow Synthesizer: Universal Audio Synthesizer Control with Normalizing Flows , 2019, Applied Sciences.

[11]  Kuansan Wang,et al.  Self-normalization and noise-robustness in early auditory representations , 1994, IEEE Trans. Speech Audio Process..

[12]  P. Kuhl Early language acquisition: cracking the speech code , 2004, Nature Reviews Neuroscience.

[13]  Ming Liu,et al.  Deep-learning in Mobile Robotics - from Perception to Control Systems: A Survey on Why and Why not , 2016, ArXiv.

[14]  Mark d'Inverno,et al.  Automatic Programming of VST Sound Synthesizers Using Deep Networks and Other Techniques , 2018, IEEE Transactions on Emerging Topics in Computational Intelligence.

[15]  Asok Ray,et al.  Neural Network-Based Learning from Demonstration of an Autonomous Ground Robot , 2019 .

[16]  N. Mesgarani,et al.  Learning Speech Production and Perception through Sensorimotor Interactions , 2020, Cerebral cortex communications.

[17]  Georg B. Keller,et al.  Sensorimotor Mismatch Signals in Primary Visual Cortex of the Behaving Mouse , 2012, Neuron.