Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment
暂无分享,去创建一个
[1] Tae-Hyun Oh,et al. Prefix Tuning for Automated Audio Captioning , 2023, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[2] Junsik Kim,et al. Event-Specific Audio-Visual Fusion Layers: A Simple and New Perspective on Video Understanding , 2023, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).
[3] Joon Son Chung,et al. MarginNCE: Robust Sound Localization with a Negative Margin , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[4] D. Erhan,et al. Phenaki: Variable Length Video Generation From Open Domain Textual Description , 2022, ICLR.
[5] David J. Fleet,et al. Imagen Video: High Definition Video Generation with Diffusion Models , 2022, ArXiv.
[6] Ben Poole,et al. DreamFusion: Text-to-3D using 2D Diffusion , 2022, ICLR.
[7] Yaniv Taigman,et al. Make-A-Video: Text-to-Video Generation without Text-Video Data , 2022, ICLR.
[8] K. M. Yi,et al. Estimating Visual Information From Audio Through Manifold Learning , 2022, ArXiv.
[9] Tae-Hyun Oh,et al. CLIP-Actor: Text-Driven Recommendation and Stylization for Animating Human Meshes , 2022, ECCV.
[10] Andrew Owens,et al. Learning Visual Styles from Audio-Visual Associations , 2022, ECCV.
[11] Oriol Vinyals,et al. Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.
[12] Prafulla Dhariwal,et al. Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.
[13] Junsik Kim,et al. Less Can Be More: Sound Source Localization With a Classification Model , 2022, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).
[14] Wonmin Byeon,et al. Sound-Guided Semantic Image Manipulation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[15] J. Bello,et al. Wav2CLIP: Learning Robust Audio Representations from Clip , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[16] Andrew Owens,et al. Strumming to the Beat: Audio-Conditioned Contrastive Video Textures , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).
[17] Ron Mokady,et al. ClipCap: CLIP Prefix for Image Captioning , 2021, ArXiv.
[18] Esa Rahtu,et al. Taming Visually Guided Sound Generation , 2021, BMVC.
[19] Yong Jae Lee,et al. Collaging Class-specific GANs for Semantic Image Synthesis , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[20] Michal Drozdzal,et al. Instance-Conditioned GAN , 2021, NeurIPS.
[21] Chang Zhou,et al. CogView: Mastering Text-to-Image Generation via Transformers , 2021, NeurIPS.
[22] Shih-Fu Chang,et al. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.
[23] Yongqin Xian,et al. Distilling Audio-Visual Knowledge by Compositional Contrastive Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[24] Andrea Vedaldi,et al. Localizing Visual Sounds the Hard Way , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[25] Daniel Cohen-Or,et al. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[26] Nuno Vasconcelos,et al. Robust Audio-Visual Instance Discrimination , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[27] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[28] Alec Radford,et al. Zero-Shot Text-to-Image Generation , 2021, ICML.
[29] Daniel Cohen-Or,et al. Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[30] Jianping Gou,et al. Knowledge Distillation: A Survey , 2020, International Journal of Computer Vision.
[31] N. Vasconcelos,et al. Audio-Visual Instance Discrimination with Cross-Modal Agreement , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[32] Tae-Hyun Oh,et al. Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[33] Climent Nadeu,et al. Sound-to-Imagination: Unsupervised Crossmodal Translation Using Deep Dense Network Architecture , 2021, ArXiv.
[34] Trevor Darrell,et al. Benchmark for Compositional Text-to-Image Synthesis , 2021, NeurIPS Datasets and Benchmarks.
[35] T. Winterbottom,et al. On Modality Bias in the TVQA Dataset , 2020, BMVC.
[36] Weiyao Lin,et al. Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching , 2020, NeurIPS.
[37] Andrew Owens,et al. Self-Supervised Learning of Audio-Visual Objects from Video , 2020, ECCV.
[38] Anoop Cherian,et al. Sound2Sight: Generating Visual Dynamics from Sound and Context , 2020, ECCV.
[39] Chuang Gan,et al. Generating Visually Aligned Sound From Videos , 2020, IEEE Transactions on Image Processing.
[40] Kun Su,et al. Audeo: Audio Generation for a Silent Performance Video , 2020, NeurIPS.
[41] Julien Mairal,et al. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.
[42] Andrew Zisserman,et al. Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[43] Bernard Ghanem,et al. Self-Supervised Learning by Cross-Modal Audio-Video Clustering , 2019, NeurIPS.
[44] Chuang Gan,et al. Self-Supervised Moving Vehicle Tracking With Stereo Sound , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[45] Jitendra Malik,et al. Learning Individual Styles of Conversational Gesture , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[46] Tae-Hyun Oh,et al. Speech2Face: Learning the Face Behind a Voice , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[47] Kristen Grauman,et al. Co-Separating Sounds of Visual Objects , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[48] Peter Wonka,et al. Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space? , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[49] Bo Zhao,et al. Image Generation From Layout , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[50] Jeff Donahue,et al. Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.
[51] Shun-Po Chuang,et al. Towards Audio to Scene Image Synthesis Using Generative Adversarial Network , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[52] Xuelong Li,et al. Deep Multimodal Clustering for Unsupervised Audiovisual Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[53] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.
[54] Andrew Owens,et al. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.
[55] Li Fei-Fei,et al. Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[56] Tae-Hyun Oh,et al. Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[57] Andrew Owens,et al. Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning , 2017, International Journal of Computer Vision.
[58] Andrew Zisserman,et al. Objects that Sound , 2017, ECCV.
[59] Chen Fang,et al. Visual to Sound: Generating Natural Sound for Videos in the Wild , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[60] Zhaoxiang Zhang,et al. CMCGAN: A Uniform Framework for Cross-Modal Visual-Audio Mutual Generation , 2017, AAAI.
[61] Oriol Vinyals,et al. Neural Discrete Representation Learning , 2017, NIPS.
[62] Sepp Hochreiter,et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.
[63] Andrew Zisserman,et al. Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[64] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.
[65] Chenliang Xu,et al. Deep Cross-Modal Audio-Visual Generation , 2017, ACM Multimedia.
[66] Antonio Torralba,et al. SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.
[67] Andrew Owens,et al. Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.
[68] Wojciech Zaremba,et al. Improved Techniques for Training GANs , 2016, NIPS.
[69] Andrew Owens,et al. Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[70] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[71] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[72] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.
[73] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.
[74] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[75] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .