Learning Audio-Visual Correlations From Variational Cross-Modal Generation

People can easily imagine the potential sound while seeing an event. This natural synchronization between audio and visual signals reveals their intrinsic correlations. To this end, we propose to learn the audio-visual correlations from the perspective of cross-modal generation in a self-supervised manner, the learned correlations can be then readily applied in multiple downstream tasks such as the audio-visual cross-modal localization and retrieval. We introduce a novel Variational AutoEncoder (VAE) framework that consists of Multiple encoders and a Shared decoder (MS-VAE) with an additional Wasserstein distance constraint to tackle the problem. Extensive experiments demonstrate that the optimized latent representation of the proposed MS-VAE can effectively learn the audio-visual correlations and can be readily applied in multiple audio-visual downstream tasks to achieve competitive performance even without any given label information during training.

[1]  Chuang Gan,et al.  Foley Music: Learning to Generate Music from Videos , 2020, ECCV.

[2]  Chuang Gan,et al.  Music Gesture for Visual Sound Separation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Sukhendu Das,et al.  See the Sound, Hear the Pixels , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[4]  Yan Yan,et al.  Dual Attention Matching for Audio-Visual Event Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Cagri Ozcinar,et al.  Towards Generating Ambisonics Using Audio-visual Cue for Virtual Reality , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Chuang Gan,et al.  Self-supervised Audio-visual Co-segmentation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Yingtao Tian,et al.  Latent Translation: Crossing Modalities by Bridging Generative Models , 2019, ArXiv.

[8]  Yu-Chiang Frank Wang,et al.  Dual-modality Seq2Seq Network for Audio-visual Event Localization , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Alexander M. Rush,et al.  Latent Alignment and Variational Attention , 2018, NeurIPS.

[10]  Xuelong Li,et al.  Deep Multimodal Clustering for Unsupervised Audiovisual Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Ross B. Girshick,et al.  Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.

[12]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[13]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[14]  Rogério Schmidt Feris,et al.  Learning to Separate Object Sounds by Watching Unlabeled Video , 2018, ECCV.

[15]  Otmar Hilliges,et al.  Cross-Modal Deep Variational Hand Pose Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Chenliang Xu,et al.  Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.

[17]  Tae-Hyun Oh,et al.  Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Ivor W. Tsang,et al.  Degeneration in VAE: in the Light of Fisher Information Loss , 2018, ArXiv.

[19]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[20]  Bernhard Schölkopf,et al.  Wasserstein Auto-Encoders , 2017, ICLR.

[21]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[22]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Chenliang Xu,et al.  Deep Cross-Modal Audio-Visual Generation , 2017, ACM Multimedia.

[24]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[26]  G. Peyré,et al.  Sliced and Radon Wasserstein Barycenters of Measures , 2014, Journal of Mathematical Imaging and Vision.

[27]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[28]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[29]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[30]  Yapeng Tian,et al.  Audio-Visual Interpretable and Controllable Video Captioning , 2019, CVPR Workshops.