WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research
暂无分享,去创建一个
Qiuqiang Kong | Wenwu Wang | Yuexian Zou | Haohe Liu | Xinhao Mei | MarkD . Plumbley | Tom Ko | Chengqi Zhao | Chutong Meng | M. Plumbley
[1] Xuenan Xu,et al. BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data , 2023, ACM Multimedia.
[2] Dongchao Yang,et al. Improving Text-Audio Retrieval by Text-Aware Attention Pooling and Prior Matrix Revised Loss , 2023, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[3] Wenwu Wang,et al. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models , 2023, ICML.
[4] Timo I. Denk,et al. MusicLM: Generating Music From Text , 2023, ArXiv.
[5] Yusong Wu,et al. Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[6] Chao Weng,et al. Diffsound: Discrete Diffusion Model for Text-to-Sound Generation , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[7] Benjamin Elizalde,et al. CLAP: Learning Audio Concepts From Natural Language Supervision , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[8] Daniel C. Tompkins,et al. BEATs: Audio Pre-Training with Acoustic Tokenizers , 2022, ICML.
[9] J. Kittler,et al. ASiT: Audio Spectrogram vIsion Transformer for General Audio Representation , 2022, ArXiv.
[10] Dongchao Yang,et al. FeatureCut: An Adaptive Data Augmentation for Automated Audio Captioning , 2022, 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).
[11] Eungbeom Kim,et al. Improving Audio-Language Learning with MixGen and Multi-Level Test-Time Augmentation , 2022, ArXiv.
[12] Yu Zhang,et al. Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention , 2022, ArXiv.
[13] Yaniv Taigman,et al. AudioGen: Textually Guided Audio Generation , 2022, ICLR.
[14] Benjamin Elizalde,et al. Audio Retrieval with WavText5K and CLAP Training , 2022, INTERSPEECH 2023.
[15] Wenwu Wang,et al. Automated audio captioning: an overview of recent progress and new challenges , 2022, EURASIP Journal on Audio, Speech, and Music Processing.
[16] Avi Gazneli,et al. End-to-End Audio Strikes Back: Boosting Augmentations Towards An Efficient Audio Classification Network , 2022, ArXiv.
[17] T. Virtanen,et al. Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering , 2022, 2022 30th European Signal Processing Conference (EUSIPCO).
[18] Prafulla Dhariwal,et al. Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.
[19] Wenwu Wang,et al. On Metric Learning for Audio-Text Cross-Modal Retrieval , 2022, INTERSPEECH.
[20] Chng Eng Siong,et al. Interactive Audio-text Representation for Automated Audio Captioning with Contrastive Learning , 2022, arXiv.org.
[21] Qiuqiang Kong,et al. Separate What You Describe: Language-Queried Audio Source Separation , 2022, INTERSPEECH.
[22] Wenwu Wang,et al. Leveraging Pre-trained BERT for Audio Captioning , 2022, 2022 30th European Signal Processing Conference (EUSIPCO).
[23] S. Dubnov,et al. HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[24] S. Hoi,et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.
[25] B. Ommer,et al. High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[26] João F. Henriques,et al. Audio Retrieval With Natural Language Queries: A Benchmark Study , 2021, IEEE Transactions on Multimedia.
[27] J. Bello,et al. Wav2CLIP: Learning Robust Audio Representations from Clip , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[28] Federico Raue,et al. Audioclip: Extending Clip to Image, Text and Audio , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[29] X. Serra,et al. FSD50K: An Open Dataset of Human-Labeled Sound Events , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[30] Roger Zimmermann,et al. Multimodal research in vision and language: A review of current and emerging trends , 2022, Inf. Fusion.
[31] Helin Wang,et al. Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Semantic Information , 2021, DCASE.
[32] Mark D. Plumbley,et al. Audio Captioning Transformer , 2021, DCASE.
[33] Junnan Li,et al. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.
[34] Mark D. Plumbley,et al. Sound Event Detection: A tutorial , 2021, IEEE Signal Processing Magazine.
[35] Aren Jansen,et al. The Benefit of Temporally-Strong Labels in Audio Event Classification , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[36] James R. Glass,et al. AST: Audio Spectrogram Transformer , 2021, Interspeech.
[37] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[38] Radu Soricut,et al. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[39] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.
[40] Xiang Li,et al. Automated Audio Captioning with Weakly Supervised Pre-Training and Word Selection Methods , 2021, DCASE.
[41] Félix Gontier,et al. Automated Audio Captioning by Fine-Tuning BART with AudioSet Tags , 2021, DCASE.
[42] A. Linear-probe,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021 .
[43] Kyosuke Nishida,et al. A Transformer-based Audio Captioning Model with Keyword Estimation , 2020, INTERSPEECH.
[44] Pieter Abbeel,et al. Denoising Diffusion Probabilistic Models , 2020, NeurIPS.
[45] Andrew Zisserman,et al. Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[46] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.
[47] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.
[48] Lin Su,et al. ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data , 2020, ArXiv.
[49] Mark D. Plumbley,et al. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[50] Haytham M. Fayek,et al. Temporal Reasoning via Audio Question Answering , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[51] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.
[52] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[53] Tuomas Virtanen,et al. Clotho: an Audio Captioning Dataset , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[54] Jason J. Corso,et al. Unified Vision-Language Pre-Training for Image Captioning and VQA , 2019, AAAI.
[55] Gunhee Kim,et al. AudioCaps: Generating Captions for Audios in The Wild , 2019, NAACL.
[56] Mark D. Plumbley,et al. Weakly Labelled AudioSet Tagging With Attention Neural Networks , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[57] Kai Yu,et al. Audio Caption: Listen and Tell , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[58] Xinlei Chen,et al. nocaps: novel object captioning at scale , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[59] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[60] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.
[61] David J. Fleet,et al. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.
[62] Shane Legg,et al. Deep Reinforcement Learning from Human Preferences , 2017, NIPS.
[63] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[64] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[65] Siqi Liu,et al. Improved Image Captioning via Policy Gradient optimization of SPIDEr , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).
[66] Qiang Huang,et al. Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[67] Basura Fernando,et al. SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.
[68] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..
[69] Karol J. Piczak. ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.
[70] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.
[71] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.
[72] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[73] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[74] Mark D. Plumbley,et al. Acoustic Scene Classification: Classifying environments from the sounds they produce , 2014, IEEE Signal Processing Magazine.
[75] Justin Salamon,et al. A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.
[76] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.
[77] Xavier Serra,et al. Freesound technical demo , 2013, ACM Multimedia.
[78] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[79] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.
[80] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[81] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[82] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.