AudioViewer: Learning to Visualize Sound

Sensory substitution can help persons with perceptual deficits. In this work, we attempt to visualize audio with video. Our long-term goal is to create sound perception for hearing impaired people, for instance, to facilitate feedback for training deaf speech. Different from existing models that translate between speech and text or text and images, we target an immediate and low-level translation that applies to generic environment sounds and human speech without delay. No canonical mapping is known for this artificial translation task. Our design is to translate from audio to video by compressing both into a common latent space with shared structure. Our core contribution is the development and evaluation of learned mappings that respect human perception limits and maximize user comfort by enforcing priors and combining strategies from unpaired image translation and disentanglement. We demonstrate qualitatively and quantitatively that our AudioViewer model maintains important audio features in the generated video and that generated videos of faces and numbers are well suited for visualizing high-dimensional audio features since they can easily be parsed by humans to match and distinguish between sounds, words, and speakers.

[1]  M. D’Esposito,et al.  An Area within Human Ventral Cortex Sensitive to “Building” Stimuli Evidence and Implications , 1998, Neuron.

[2]  George Sperling,et al.  The information available in brief visual presentations. , 1960 .

[3]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[4]  Anne-Marie Oster Speech , Music and Hearing Quarterly Progress and Status Report Teaching speech skills to deaf children by computer-based speech training , 2007 .

[5]  Yingtao Tian,et al.  Latent Translation: Crossing Modalities by Bridging Generative Models , 2019, ArXiv.

[6]  Hang Zhou,et al.  Talking Face Generation by Adversarially Disentangled Audio-Visual Representation , 2018, AAAI.

[7]  Carlos Busso,et al.  Speech-Driven Expressive Talking Lips with Conditional Sequential Generative Adversarial Networks , 2018, IEEE Transactions on Affective Computing.

[8]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[9]  Ira Kemelmacher-Shlizerman,et al.  Audio to Body Dynamics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  LinLin Shen,et al.  Deep Feature Consistent Variational Autoencoder , 2016, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[11]  Tetsuya Ogata,et al.  Audio-visual speech recognition using deep learning , 2014, Applied Intelligence.

[12]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[13]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[14]  Shachar Maidenbaum,et al.  Author's Personal Copy Neuroscience and Biobehavioral Reviews Sensory Substitution: Closing the Gap between Basic Research and Widespread Practical Visual Rehabilitation Author's Personal Copy , 2022 .

[15]  Masataka Goto,et al.  Music Thumbnailer: Visualizing Musical Pieces in Thumbnail Images Based on Acoustic Features , 2008, ISMIR.

[16]  Andrew Zisserman,et al.  X2Face: A network for controlling face generation by using images, audio, and pose codes , 2018, ECCV.

[17]  Stephen J. Cox,et al.  Limitations of visual speech recognition , 2010, AVSP.

[18]  Huamin Qu,et al.  SpeechLens: A Visual Analytics Approach for Exploring Speech Strategies with Textural and Acoustic Features , 2019, 2019 IEEE International Conference on Big Data and Smart Computing (BigComp).

[19]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[20]  Dongsuk Yook,et al.  Many-to-Many Voice Conversion using Cycle-Consistent Variational Autoencoder with Multiple Decoders , 2019 .

[21]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Robert J. K. Jacob,et al.  The Face as a Data Display , 1976 .

[23]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  M. Tarr,et al.  FFA: a flexible fusiform area for subordinate-level visual processing automatized by expertise , 2000, Nature Neuroscience.

[25]  Yu Zhang,et al.  Learning Latent Representations for Speech Generation and Transformation , 2017, INTERSPEECH.

[26]  Jonas Beskow,et al.  Visual phonemic ambiguity and speechreading. , 2006, Journal of speech, language, and hearing research : JSLHR.

[27]  Sun-Hyung Park,et al.  Integrated speech training system for hearing impaired , 1994 .

[28]  Rüdiger Hoffmann,et al.  Audiovisual Tools for Phonetic and Articulatory Visualization in Computer-Aided Pronunciation Training , 2009, COST 2102 Training School.

[29]  Xuelong Li,et al.  Listen to the Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Lior Wolf,et al.  Unsupervised Cross-Domain Image Generation , 2016, ICLR.

[31]  Masataka Goto,et al.  Instrudive: A Music Visualization System Based on Automatically Recognized Instrumentation , 2018, ISMIR.

[32]  S. Antia,et al.  Effects of Multisensory Speech Training and Visual Phonics on Speech Production of a Hearing-Impaired Child , 1993 .

[33]  N. Kanwisher,et al.  The fusiform face area is selective for faces not animals. , 1999, Neuroreport.

[34]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[35]  Yisong Yue,et al.  A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[36]  E. Musk An Integrated Brain-Machine Interface Platform With Thousands of Channels , 2019, bioRxiv.

[37]  Serge J. Belongie,et al.  Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Gaëtan Hadjeres,et al.  Deep Learning Techniques for Music Generation - A Survey , 2017, ArXiv.

[39]  Maneesh Kumar Singh,et al.  Disentangling Factors of Variation with Cycle-Consistent Variational Auto-Encoders , 2018, ECCV.

[40]  Bhiksha Raj,et al.  Reconstructing faces from voices , 2019, ArXiv.

[41]  Sonya Mehta,et al.  Visual Feedback of Tongue Movement for Novel Speech Sound Learning , 2015, Front. Hum. Neurosci..

[42]  L. C. Stewart,et al.  A real time spectrograph with implications for speech training for the deaf , 1976, ICASSP.

[43]  Herman Kamper,et al.  Truly Unsupervised Acoustic Word Embeddings Using Weak Top-down Constraints in Encoder-decoder Models , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Maja Pantic,et al.  End-to-End Speech-Driven Facial Animation with Temporal GANs , 2018, BMVC.

[45]  Xu Wang,et al.  Speech Visualization based on Robust Self-organizing Map (RSOM) for the Hearing Impaired , 2008, BMEI.

[46]  Xu Wang,et al.  Speech visualization based on improved spectrum for deaf children , 2010, 2010 Chinese Control and Decision Conference.

[47]  John M. Levis,et al.  Teaching Intonation in Discourse Using Speech Visualization Technology. , 2004 .

[48]  Luc Duong,et al.  CycleGAN for style transfer in X-ray angiography , 2019, International Journal of Computer Assisted Radiology and Surgery.