Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning
暂无分享,去创建一个
Xavier Serra | Konstantinos Drossos | Xavier Favory | Benno Weck | Xavier Favory | K. Drossos | Xavier Serra | B. Weck | Benno Weck
[1] Kai Yu,et al. Audio Caption in a Car Setting with a Sentence-Level Loss , 2019, 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP).
[2] M. Sert,et al. Audio Captioning Based on Combined Audio and Semantic Embeddings , 2020, 2020 IEEE International Symposium on Multimedia (ISM).
[3] Anna Rumshisky,et al. A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.
[4] Ryo Masumura,et al. A Transformer-based Audio Captioning Model with Keyword Estimation , 2020, INTERSPEECH.
[5] Tomas Mikolov,et al. Advances in Pre-Training Distributed Word Representations , 2017, LREC.
[6] Xavier Serra,et al. Freesound technical demo , 2013, ACM Multimedia.
[7] Tuomas Virtanen,et al. Multi-task Regularization Based on Infrequent Classes for Audio Captioning , 2020, DCASE.
[8] Kunio Kashino,et al. Neural Audio Captioning Based on Conditional Sequence-to-Sequence Model , 2019, DCASE.
[9] Kai Yu,et al. Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[10] Tuomas Virtanen,et al. WaveTransformer: An Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information , 2020, 2021 29th European Signal Processing Conference (EUSIPCO).
[11] Tuomas Virtanen,et al. Clotho: an Audio Captioning Dataset , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[12] Gunhee Kim,et al. AudioCaps: Generating Captions for Audios in The Wild , 2019, NAACL.
[13] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[14] Masahiro Yasuda,et al. Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval , 2020, ArXiv.
[15] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[16] Siqi Liu,et al. Improved Image Captioning via Policy Gradient optimization of SPIDEr , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).
[17] Kunio Kashino,et al. Effects of Word-frequency based Pre- and Post- Processings for Audio Captioning , 2020, DCASE.
[19] Mustafa Sert,et al. Audio Captioning using Gated Recurrent Units , 2020, ArXiv.
[20] Tuomas Virtanen,et al. Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning , 2020, DCASE.
[21] Xavier Serra,et al. COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations , 2020, ArXiv.
[22] Kun Chen,et al. Audio Captioning Based on Transformer and Pre-Trained CNN , 2020, DCASE.
[23] Tuomas Virtanen,et al. Automated audio captioning with recurrent neural networks , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).
[24] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[25] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.
[26] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.
[27] Kai Yu,et al. A CRNN-GRU Based Reinforcement Learning Approach to Audio Captioning , 2020, DCASE.
[28] Andrew Zisserman,et al. Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[29] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[30] Kai Yu,et al. Audio Caption: Listen and Tell , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[31] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.
[32] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[33] Xipeng Qiu,et al. A Survey of Transformers , 2021, AI Open.
[34] Kun Chen,et al. AUDIO CAPTIONING BASED ON TRANSFORMER AND PRE-TRAINING FOR 2020 DCASE AUDIO CAPTIONING CHALLENGE Technical Report , 2020 .
[35] Justin Salamon,et al. Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[36] Tuomas Virtanen,et al. Crowdsourcing a Dataset of Audio Captions , 2019, DCASE.
[37] Bo Chen,et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.
[38] Apostol Natsev,et al. YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.