Text-to-Audio Grounding: Building Correspondence Between Captions and Sound Events

Automated Audio Captioning is a cross-modal task, generating natural language descriptions to summarize the audio clips’ sound events. However, grounding the actual sound events in the given audio based on its corresponding caption has not been investigated. This paper contributes an Audio-Grounding dataset1, which provides the correspondence be-tween sound events and the captions provided in Audiocaps, along with the location (timestamps) of each present sound event. Based on such, we propose the text-to-audio grounding (TAG) task, which interactively considers the relationship be-tween audio processing and language understanding. A base-line approach is provided, resulting in an event-F1 score of 28.3% and a Polyphonic Sound Detection Score (PSDS) score of 14.7%.

[1]  Kai Yu,et al.  Audio Caption: Listen and Tell , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Tuomas Virtanen,et al.  Automated audio captioning with recurrent neural networks , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[3]  Gunhee Kim,et al.  AudioCaps: Generating Captions for Audios in The Wild , 2019, NAACL.

[4]  Benjamin Elizalde,et al.  Cross Modal Audio Search and Retrieval with Joint Embeddings Based on Text and Audio , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Xinlei Chen,et al.  Grounded Video Description , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Kunio Kashino,et al.  The NTT DCASE2020 Challenge Task 6 system: Automated Audio Captioning with Keywords and Sentence Length Estimation , 2020, ArXiv.

[7]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[8]  Kai Yu,et al.  Towards Duration Robust Weakly Supervised Sound Event Detection , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Lei Chen,et al.  Object Grounding via Iterative Context Reasoning , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[10]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[12]  Luowei Zhou,et al.  Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction , 2018, BMVC.

[13]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Kai Yu,et al.  A CRNN-GRU Based Reinforcement Learning Approach to Audio Captioning , 2020, DCASE.

[15]  Nicolas Turpault,et al.  Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments , 2018, DCASE.

[16]  Trevor Darrell,et al.  Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.

[17]  Kun Chen,et al.  AUDIO CAPTIONING BASED ON TRANSFORMER AND PRE-TRAINING FOR 2020 DCASE AUDIO CAPTIONING CHALLENGE Technical Report , 2020 .

[18]  Minh N. Do,et al.  Unsupervised Textual Grounding: Linking Words to Image Concepts , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Shih-Fu Chang,et al.  Multi-Level Multimodal Common Semantic Space for Image-Phrase Grounding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Sacha Krstulovic,et al.  A Framework for the Robust Evaluation of Sound Event Detection , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).