Conditional Generation of Audio from Video via Foley Analogies

The sound effects that designers add to videos are designed to convey a particular artistic effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges of creating a soundtrack for a video that differs from its true sound, but that nonetheless matches the actions occurring on screen, we propose the problem of conditional Foley. We present the following contributions to address this problem. First, we propose a pretext task for training our model to predict sound for an input video clip using a conditional audio-visual clip sampled from another time within the same source video. Second, we propose a model for generating a soundtrack for a silent input video, given a user-supplied example that specifies what the video should"sound like". We show through human studies and automated evaluation metrics that our model successfully generates sound from video, while varying its output according to the content of a supplied example. Project site: https://xypb.github.io/CondFoleyGen/

[1]  Yi Ren,et al.  VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Andrew Zisserman,et al.  Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors , 2022, BMVC.

[3]  Bryan C. Russell,et al.  It's Time for Artistic Correspondence in Music and Video , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Andrew Owens,et al.  Learning Visual Styles from Audio-Visual Associations , 2022, ECCV.

[5]  K. Grauman,et al.  Visual Acoustic Matching , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Wonmin Byeon,et al.  Sound-Guided Semantic Image Manipulation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Sanchita Ghose,et al.  FoleyGAN: Visually Guided Generative Adversarial Network-Based Synchronous Sound Generation in Silent Videos , 2021, IEEE Transactions on Multimedia.

[8]  Andrew Owens,et al.  Structure from Silence: Learning Scene Structure from Ambient Sound , 2021, CoRL.

[9]  Esa Rahtu,et al.  Taming Visually Guided Sound Generation , 2021, BMVC.

[10]  Yu Wang,et al.  Who Calls The Shots? Rethinking Few-Shot Learning for Audio , 2021, 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[11]  Kristen Grauman,et al.  Geometry-Aware Multi-Task Learning for Binaural Audio Generation from Video , 2021, BMVC.

[12]  Bo Dai,et al.  Visually Informed Binaural Audio Generation without Binaural Audios , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Ling Shao,et al.  Repetitive Activity Counting by Sight and Sound , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[15]  B. Ommer,et al.  Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Efthymios Tzinis,et al.  Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds , 2020, ICLR.

[17]  Sanchita Ghose,et al.  AutoFoley: Artificial Synthesis of Synchronized Sound Tracks for Silent Videos With Deep Learning , 2020, IEEE Transactions on Multimedia.

[18]  Kun Su,et al.  Multi-Instrumentalist Net: Unsupervised Generation of Music from Body Movements , 2020, ArXiv.

[19]  Tim Sainburg,et al.  Finding, visualizing, and quantifying latent structure across diverse animal vocal repertoires , 2020, PLoS Comput. Biol..

[20]  G. Richard,et al.  DrumGAN: Synthesis of Drum Sounds With Timbral Feature Conditioning Using Generative Adversarial Networks , 2020, ISMIR.

[21]  Chuang Gan,et al.  Foley Music: Learning to Generate Music from Videos , 2020, ECCV.

[22]  Chenliang Xu,et al.  Talking-head Generation with Rhythmic Head Motion , 2020, ECCV.

[23]  Justin Salamon,et al.  Telling Left From Right: Learning Spatial Correspondence of Sight and Sound , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  C. V. Jawahar,et al.  Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Andrew Zisserman,et al.  Sight to Sound: An End-to-End Approach for Visual Piano Transcription , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Yu Wang,et al.  Few-Shot Sound Event Detection , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Andrew Zisserman,et al.  Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Yoshua Bengio,et al.  MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis , 2019, NeurIPS.

[29]  Joon Son Chung,et al.  Who said that?: Audio-visual speaker diarisation of real-world meetings , 2019, INTERSPEECH.

[30]  Cem Anil,et al.  TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer , 2018, ICLR.

[31]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[32]  Maneesh Agrawala,et al.  Visual rhythm and beat , 2018, ACM Trans. Graph..

[33]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[34]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[35]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[36]  Rogério Schmidt Feris,et al.  Learning to Separate Object Sounds by Watching Unlabeled Video , 2018, ECCV.

[37]  Julius O. Smith,et al.  Neural Style Transfer for Audio Spectograms , 2018, ArXiv.

[38]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[39]  Chen Fang,et al.  Visual to Sound: Generating Natural Sound for Videos in the Wild , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[42]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[43]  Alexei A. Efros,et al.  Real-time user-guided image colorization with learned deep priors , 2017, ACM Trans. Graph..

[44]  Shmuel Peleg,et al.  Vid2speech: Speech reconstruction from silent video , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[48]  Leon A. Gatys,et al.  Image Style Transfer Using Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[50]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Leon A. Gatys,et al.  A Neural Algorithm of Artistic Style , 2015, ArXiv.

[52]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[53]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[54]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[55]  Daniel P. W. Ellis,et al.  MIR_EVAL: A Transparent Implementation of Common MIR Metrics , 2014, ISMIR.

[56]  Dennis Van Vliet,et al.  How does that sound , 2011 .

[57]  V. Ament The Foley Grail: The Art of Performing Sound for Film, Games, and Animation , 2009 .

[58]  David Salesin,et al.  Image Analogies , 2001, SIGGRAPH.

[59]  Dinesh K. Pai,et al.  FoleyAutomatic: physically-based sound effects for interactive simulation and animation , 2001, SIGGRAPH.

[60]  John R. Hershey,et al.  Audio-Visual Sound Separation Via Hidden Markov Models , 2001, NIPS.

[61]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[62]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.