WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

The advancement of audio-language (AL) multimodal learning tasks has been significant in recent years. However, researchers face challenges due to the costly and time-consuming collection process of existing audio-language datasets, which are limited in size. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions. We sourced audio clips and their raw descriptions from web sources and a sound event detection dataset. However, the online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning. To overcome this issue, we propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically. We conduct a comprehensive analysis of the characteristics of WavCaps dataset and evaluate it on multiple downstream audio-language multimodal learning tasks. The systems trained on WavCaps outperform previous state-of-the-art (SOTA) models by a significant margin. Our aspiration is for the WavCaps dataset we have proposed to facilitate research in audio-language multimodal learning and demonstrate the potential of utilizing ChatGPT to enhance academic research. Our dataset and codes are available at https://github.com/XinhaoMei/WavCaps.

[1]  Xuenan Xu,et al.  BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data , 2023, ACM Multimedia.

[2]  Dongchao Yang,et al.  Improving Text-Audio Retrieval by Text-Aware Attention Pooling and Prior Matrix Revised Loss , 2023, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Wenwu Wang,et al.  AudioLDM: Text-to-Audio Generation with Latent Diffusion Models , 2023, ICML.

[4]  Timo I. Denk,et al.  MusicLM: Generating Music From Text , 2023, ArXiv.

[5]  Yusong Wu,et al.  Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Chao Weng,et al.  Diffsound: Discrete Diffusion Model for Text-to-Sound Generation , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Benjamin Elizalde,et al.  CLAP: Learning Audio Concepts From Natural Language Supervision , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Daniel C. Tompkins,et al.  BEATs: Audio Pre-Training with Acoustic Tokenizers , 2022, ICML.

[9]  J. Kittler,et al.  ASiT: Audio Spectrogram vIsion Transformer for General Audio Representation , 2022, ArXiv.

[10]  Dongchao Yang,et al.  FeatureCut: An Adaptive Data Augmentation for Automated Audio Captioning , 2022, 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[11]  Eungbeom Kim,et al.  Improving Audio-Language Learning with MixGen and Multi-Level Test-Time Augmentation , 2022, ArXiv.

[12]  Yu Zhang,et al.  Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention , 2022, ArXiv.

[13]  Yaniv Taigman,et al.  AudioGen: Textually Guided Audio Generation , 2022, ICLR.

[14]  Benjamin Elizalde,et al.  Audio Retrieval with WavText5K and CLAP Training , 2022, INTERSPEECH 2023.

[15]  Wenwu Wang,et al.  Automated audio captioning: an overview of recent progress and new challenges , 2022, EURASIP Journal on Audio, Speech, and Music Processing.

[16]  Avi Gazneli,et al.  End-to-End Audio Strikes Back: Boosting Augmentations Towards An Efficient Audio Classification Network , 2022, ArXiv.

[17]  T. Virtanen,et al.  Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering , 2022, 2022 30th European Signal Processing Conference (EUSIPCO).

[18]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[19]  Wenwu Wang,et al.  On Metric Learning for Audio-Text Cross-Modal Retrieval , 2022, INTERSPEECH.

[20]  Chng Eng Siong,et al.  Interactive Audio-text Representation for Automated Audio Captioning with Contrastive Learning , 2022, arXiv.org.

[21]  Qiuqiang Kong,et al.  Separate What You Describe: Language-Queried Audio Source Separation , 2022, INTERSPEECH.

[22]  Wenwu Wang,et al.  Leveraging Pre-trained BERT for Audio Captioning , 2022, 2022 30th European Signal Processing Conference (EUSIPCO).

[23]  S. Dubnov,et al.  HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[25]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  João F. Henriques,et al.  Audio Retrieval With Natural Language Queries: A Benchmark Study , 2021, IEEE Transactions on Multimedia.

[27]  J. Bello,et al.  Wav2CLIP: Learning Robust Audio Representations from Clip , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Federico Raue,et al.  Audioclip: Extending Clip to Image, Text and Audio , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  X. Serra,et al.  FSD50K: An Open Dataset of Human-Labeled Sound Events , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  Roger Zimmermann,et al.  Multimodal research in vision and language: A review of current and emerging trends , 2022, Inf. Fusion.

[31]  Helin Wang,et al.  Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Semantic Information , 2021, DCASE.

[32]  Mark D. Plumbley,et al.  Audio Captioning Transformer , 2021, DCASE.

[33]  Junnan Li,et al.  Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[34]  Mark D. Plumbley,et al.  Sound Event Detection: A tutorial , 2021, IEEE Signal Processing Magazine.

[35]  Aren Jansen,et al.  The Benefit of Temporally-Strong Labels in Audio Event Classification , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  James R. Glass,et al.  AST: Audio Spectrogram Transformer , 2021, Interspeech.

[37]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[38]  Radu Soricut,et al.  Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[40]  Xiang Li,et al.  Automated Audio Captioning with Weakly Supervised Pre-Training and Word Selection Methods , 2021, DCASE.

[41]  Félix Gontier,et al.  Automated Audio Captioning by Fine-Tuning BART with AudioSet Tags , 2021, DCASE.

[42]  A. Linear-probe,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021 .

[43]  Kyosuke Nishida,et al.  A Transformer-based Audio Captioning Model with Keyword Estimation , 2020, INTERSPEECH.

[44]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[45]  Andrew Zisserman,et al.  Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[47]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[48]  Lin Su,et al.  ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data , 2020, ArXiv.

[49]  Mark D. Plumbley,et al.  PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[50]  Haytham M. Fayek,et al.  Temporal Reasoning via Audio Question Answering , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[51]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[52]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[53]  Tuomas Virtanen,et al.  Clotho: an Audio Captioning Dataset , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54]  Jason J. Corso,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2019, AAAI.

[55]  Gunhee Kim,et al.  AudioCaps: Generating Captions for Audios in The Wild , 2019, NAACL.

[56]  Mark D. Plumbley,et al.  Weakly Labelled AudioSet Tagging With Attention Neural Networks , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[57]  Kai Yu,et al.  Audio Caption: Listen and Tell , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[58]  Xinlei Chen,et al.  nocaps: novel object captioning at scale , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[59]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[60]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[61]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[62]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[63]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[64]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[65]  Siqi Liu,et al.  Improved Image Captioning via Policy Gradient optimization of SPIDEr , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[66]  Qiang Huang,et al.  Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[67]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[68]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[69]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[70]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[71]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[72]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[73]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Mark D. Plumbley,et al.  Acoustic Scene Classification: Classifying environments from the sounds they produce , 2014, IEEE Signal Processing Magazine.

[75]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[76]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[77]  Xavier Serra,et al.  Freesound technical demo , 2013, ACM Multimedia.

[78]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[79]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[80]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[81]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[82]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.