Neural Audio Captioning Based on Conditional Sequence-to-Sequence Model

We propose an audio captioning system that describes non-speech audio signals in the form of natural language. Unlike existing systems, this system can generate a sentence describing sounds, rather than an object label or onomatopoeia. This allows the description to include more information, such as how the sound is heard and how the tone or volume changes over time, and can accommodate unknown sounds. A major problem in realizing this capability is that the validity of the description depends not only on the sound itself but also on the situation or context. To address this problem, a conditional sequence-to-sequence model is proposed. In this model, a parameter called “specificity” is introduced as a condition to control the amount of information contained in the output text and generate an appropriate description. Experiments show that the model works effectively.

[1]  Xavier Serra,et al.  Freesound Datasets: A Platform for the Creation of Open Audio Datasets , 2017, ISMIR.

[2]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[3]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5]  Daniel P. W. Ellis,et al.  Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016) , 2016 .

[6]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[7]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[8]  Bryan R. Conroy,et al.  Ensemble of feature-based and deep learning-based classifiers for detection of abnormal heart sounds , 2016, 2016 Computing in Cardiology Conference (CinC).

[9]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  David Vandyke,et al.  Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems , 2015, EMNLP.

[11]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[12]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Tobias Watzka,et al.  Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018) , 2018 .

[14]  K. Kashino,et al.  Acoustic event search with an onomatopoeic query: measuring distance between onomatopoeic words and sounds , 2018, DCASE.

[15]  Mark D. Plumbley,et al.  Acoustic Scene Classification: Classifying environments from the sounds they produce , 2014, IEEE Signal Processing Magazine.

[16]  Heikki Huttunen,et al.  Polyphonic sound event detection using multi label deep neural networks , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[17]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[18]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[19]  Yan Song,et al.  Robust sound event recognition using convolutional neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Noboru Harada,et al.  Optimizing acoustic feature extractor for anomalous sound detection based on Neyman-Pearson lemma , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[21]  Jianfeng Gao,et al.  A Persona-Based Neural Conversation Model , 2016, ACL.

[22]  Kunio Kashino,et al.  Generating Sound Words from Audio Signals of Acoustic Events with Sequence-to-Sequence Model , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).