Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning

Automated audio captioning (AAC) is the task of automatically generating textual descriptions for general audio signals. A captioning system has to identify various information from the input signal and express it with natural language. Existing works mainly focus on investigating new methods and try to improve their performance measured on existing datasets. Having attracted attention only recently, very few works on AAC study the performance of existing pre-trained audio and natural language processing resources. In this paper, we evaluate the performance of off-the-shelf models with a Transformer-based captioning approach. We utilize the freely available Clotho dataset to compare four different pre-trained machine listening models, four word embedding models, and their combinations in many different settings. Our evaluation suggests that YAMNet combined with BERT embeddings produces the best captions. Moreover, in general, fine-tuning pre-trained word embeddings can lead to better performance. Finally, we show that sequences of audio embeddings can be processed using a Transformer encoder to produce higher-quality captions.

[1]  Kai Yu,et al.  Audio Caption in a Car Setting with a Sentence-Level Loss , 2019, 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[2]  M. Sert,et al.  Audio Captioning Based on Combined Audio and Semantic Embeddings , 2020, 2020 IEEE International Symposium on Multimedia (ISM).

[3]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[4]  Ryo Masumura,et al.  A Transformer-based Audio Captioning Model with Keyword Estimation , 2020, INTERSPEECH.

[5]  Tomas Mikolov,et al.  Advances in Pre-Training Distributed Word Representations , 2017, LREC.

[6]  Xavier Serra,et al.  Freesound technical demo , 2013, ACM Multimedia.

[7]  Tuomas Virtanen,et al.  Multi-task Regularization Based on Infrequent Classes for Audio Captioning , 2020, DCASE.

[8]  Kunio Kashino,et al.  Neural Audio Captioning Based on Conditional Sequence-to-Sequence Model , 2019, DCASE.

[9]  Kai Yu,et al.  Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Tuomas Virtanen,et al.  WaveTransformer: An Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information , 2020, 2021 29th European Signal Processing Conference (EUSIPCO).

[11]  Tuomas Virtanen,et al.  Clotho: an Audio Captioning Dataset , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Gunhee Kim,et al.  AudioCaps: Generating Captions for Audios in The Wild , 2019, NAACL.

[13]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[14]  Masahiro Yasuda,et al.  Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval , 2020, ArXiv.

[15]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Siqi Liu,et al.  Improved Image Captioning via Policy Gradient optimization of SPIDEr , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Kunio Kashino,et al.  Effects of Word-frequency based Pre- and Post- Processings for Audio Captioning , 2020, DCASE.

[18]  THE DCASE 2021 CHALLENGE TASK 6 SYSTEM: AUTOMATED AUDIO CAPTIONING WITH WEAKLY SUPERVISED PRE-TRAING AND WORD SELECTION METHODS Technical Report , 2021 .

[19]  Mustafa Sert,et al.  Audio Captioning using Gated Recurrent Units , 2020, ArXiv.

[20]  Tuomas Virtanen,et al.  Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning , 2020, DCASE.

[21]  Xavier Serra,et al.  COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations , 2020, ArXiv.

[22]  Kun Chen,et al.  Audio Captioning Based on Transformer and Pre-Trained CNN , 2020, DCASE.

[23]  Tuomas Virtanen,et al.  Automated audio captioning with recurrent neural networks , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[24]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[25]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[26]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[27]  Kai Yu,et al.  A CRNN-GRU Based Reinforcement Learning Approach to Audio Captioning , 2020, DCASE.

[28]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[30]  Kai Yu,et al.  Audio Caption: Listen and Tell , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[32]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Xipeng Qiu,et al.  A Survey of Transformers , 2021, AI Open.

[34]  Kun Chen,et al.  AUDIO CAPTIONING BASED ON TRANSFORMER AND PRE-TRAINING FOR 2020 DCASE AUDIO CAPTIONING CHALLENGE Technical Report , 2020 .

[35]  Justin Salamon,et al.  Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Tuomas Virtanen,et al.  Crowdsourcing a Dataset of Audio Captions , 2019, DCASE.

[37]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[38]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.