论文信息 - WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

The advancement of audio-language (AL) multimodal learning tasks has been significant in recent years. However, researchers face challenges due to the costly and time-consuming collection process of existing audio-language datasets, which are limited in size. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions. We sourced audio clips and their raw descriptions from web sources and a sound event detection dataset. However, the online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning. To overcome this issue, we propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically. We conduct a comprehensive analysis of the characteristics of WavCaps dataset and evaluate it on multiple downstream audio-language multimodal learning tasks. The systems trained on WavCaps outperform previous state-of-the-art (SOTA) models by a significant margin. Our aspiration is for the WavCaps dataset we have proposed to facilitate research in audio-language multimodal learning and demonstrate the potential of utilizing ChatGPT to enhance academic research. Our dataset and codes are available at https://github.com/XinhaoMei/WavCaps.

[1] Xuenan Xu,et al. BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data , 2023, ACM Multimedia.

[2] Dongchao Yang,et al. Improving Text-Audio Retrieval by Text-Aware Attention Pooling and Prior Matrix Revised Loss , 2023, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Wenwu Wang,et al. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models , 2023, ICML.

[4] Timo I. Denk,et al. MusicLM: Generating Music From Text , 2023, ArXiv.

[5] Yusong Wu,et al. Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Chao Weng,et al. Diffsound: Discrete Diffusion Model for Text-to-Sound Generation , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7] Benjamin Elizalde,et al. CLAP: Learning Audio Concepts From Natural Language Supervision , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Daniel C. Tompkins,et al. BEATs: Audio Pre-Training with Acoustic Tokenizers , 2022, ICML.

[9] J. Kittler,et al. ASiT: Audio Spectrogram vIsion Transformer for General Audio Representation , 2022, ArXiv.

[10] Dongchao Yang,et al. FeatureCut: An Adaptive Data Augmentation for Automated Audio Captioning , 2022, 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[11] Eungbeom Kim,et al. Improving Audio-Language Learning with MixGen and Multi-Level Test-Time Augmentation , 2022, ArXiv.

[12] Yu Zhang,et al. Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention , 2022, ArXiv.

[13] Yaniv Taigman,et al. AudioGen: Textually Guided Audio Generation , 2022, ICLR.

[14] Benjamin Elizalde,et al. Audio Retrieval with WavText5K and CLAP Training , 2022, INTERSPEECH 2023.

[15] Wenwu Wang,et al. Automated audio captioning: an overview of recent progress and new challenges , 2022, EURASIP Journal on Audio, Speech, and Music Processing.

[16] Avi Gazneli,et al. End-to-End Audio Strikes Back: Boosting Augmentations Towards An Efficient Audio Classification Network , 2022, ArXiv.

[17] T. Virtanen,et al. Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering , 2022, 2022 30th European Signal Processing Conference (EUSIPCO).

[18] Prafulla Dhariwal,et al. Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[19] Wenwu Wang,et al. On Metric Learning for Audio-Text Cross-Modal Retrieval , 2022, INTERSPEECH.

[20] Chng Eng Siong,et al. Interactive Audio-text Representation for Automated Audio Captioning with Contrastive Learning , 2022, arXiv.org.

[21] Qiuqiang Kong,et al. Separate What You Describe: Language-Queried Audio Source Separation , 2022, INTERSPEECH.

[22] Wenwu Wang,et al. Leveraging Pre-trained BERT for Audio Captioning , 2022, 2022 30th European Signal Processing Conference (EUSIPCO).

[23] S. Dubnov,et al. HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24] S. Hoi,et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[25] B. Ommer,et al. High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26] João F. Henriques,et al. Audio Retrieval With Natural Language Queries: A Benchmark Study , 2021, IEEE Transactions on Multimedia.

[27] J. Bello,et al. Wav2CLIP: Learning Robust Audio Representations from Clip , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28] Federico Raue,et al. Audioclip: Extending Clip to Image, Text and Audio , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29] X. Serra,et al. FSD50K: An Open Dataset of Human-Labeled Sound Events , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30] Roger Zimmermann,et al. Multimodal research in vision and language: A review of current and emerging trends , 2022, Inf. Fusion.

[31] Helin Wang,et al. Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Semantic Information , 2021, DCASE.

[32] Mark D. Plumbley,et al. Audio Captioning Transformer , 2021, DCASE.

[33] Junnan Li,et al. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[34] Mark D. Plumbley,et al. Sound Event Detection: A tutorial , 2021, IEEE Signal Processing Magazine.

[35] Aren Jansen,et al. The Benefit of Temporally-Strong Labels in Audio Event Classification , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36] James R. Glass,et al. AST: Audio Spectrogram Transformer , 2021, Interspeech.

[37] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[38] Radu Soricut,et al. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[40] Xiang Li,et al. Automated Audio Captioning with Weakly Supervised Pre-Training and Word Selection Methods , 2021, DCASE.

[41] Félix Gontier,et al. Automated Audio Captioning by Fine-Tuning BART with AudioSet Tags , 2021, DCASE.

[42] A. Linear-probe,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021 .

[43] Kyosuke Nishida,et al. A Transformer-based Audio Captioning Model with Keyword Estimation , 2020, INTERSPEECH.

[44] Pieter Abbeel,et al. Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[45] Andrew Zisserman,et al. Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[47] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[48] Lin Su,et al. ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data , 2020, ArXiv.

[49] Mark D. Plumbley,et al. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[50] Haytham M. Fayek,et al. Temporal Reasoning via Audio Question Answering , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[51] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[52] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[53] Tuomas Virtanen,et al. Clotho: an Audio Captioning Dataset , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54] Jason J. Corso,et al. Unified Vision-Language Pre-Training for Image Captioning and VQA , 2019, AAAI.

[55] Gunhee Kim,et al. AudioCaps: Generating Captions for Audios in The Wild , 2019, NAACL.

[56] Mark D. Plumbley,et al. Weakly Labelled AudioSet Tagging With Attention Neural Networks , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[57] Kai Yu,et al. Audio Caption: Listen and Tell , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[58] Xinlei Chen,et al. nocaps: novel object captioning at scale , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[59] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[60] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[61] David J. Fleet,et al. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[62] Shane Legg,et al. Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[63] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[64] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[65] Siqi Liu,et al. Improved Image Captioning via Policy Gradient optimization of SPIDEr , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[66] Qiang Huang,et al. Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[67] Basura Fernando,et al. SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[68] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..

[69] Karol J. Piczak. ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[70] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[71] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[72] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[73] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[74] Mark D. Plumbley,et al. Acoustic Scene Classification: Classifying environments from the sounds they produce , 2014, IEEE Signal Processing Magazine.

[75] Justin Salamon,et al. A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[76] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[77] Xavier Serra,et al. Freesound technical demo , 2013, ACM Multimedia.

[78] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[79] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[80] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[81] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[82] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.