Multi-label Few-shot Learning for Sound Event Recognition

Few-shot classification aims to generalize the concept from seen classes to unseen novel classes using only a few examples. Although significant progress in few-shot classification has been made, most approaches focus on a standard multi-class scenario and are based on learning single-label embedding of the labeled examples to classify the unlabeled examples. Besides, we note that state-of-the-art methods in few-shot learning mostly adopt a metric-based architecture and the the so-called episode training strategy. While this approach works nicely for multiclass classification, it is hard to apply it to the multi-label scenario because of the complexity of forming an episode. In this paper, we propose a One-vs.-Rest episode selection strategy to mitigate this issue and apply the strategy to the multi-label few-shot problem. Experiments conducted using the large-scale data found in the AudioSet show that the models with our training strategy extract the semantic features under the multi-label setting.

[1]  Toan H. Vu,et al.  DEEP LEARNING FOR DCASE 2017 CHALLENGE , 2017 .

[2]  Gregory R. Koch,et al.  Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[3]  Ramakanth Kavuluru,et al.  Few-Shot and Zero-Shot Multi-Label Learning for Structured Label Spaces , 2018, EMNLP.

[4]  Yi-Hsuan Yang,et al.  Learning to Match Transient Sound Events Using Attentional Similarity for Few-shot Sound Recognition , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Yi-Hsuan Yang,et al.  Weakly-supervised audio event detection using event-specific Gaussian filters and fully convolutional networks , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Tao Xiang,et al.  Learning to Compare: Relation Network for Few-Shot Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Yi-Hsuan Yang,et al.  Event Localization in Music Auto-tagging , 2016, ACM Multimedia.

[8]  Kyogu Lee,et al.  Ensemble of Convolutional Neural Networks for Weakly-supervised Sound Event Detection Using Multiple Scale Input , 2017, DCASE.

[9]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Min-Ling Zhang,et al.  A Review on Multi-Label Learning Algorithms , 2014, IEEE Transactions on Knowledge and Data Engineering.

[11]  M. Aly Survey on Multiclass Classification Methods , 2005 .

[12]  Ivor W. Tsang,et al.  Survey on Multi-Output Learning , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[13]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[14]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Rogério Schmidt Feris,et al.  LaSO: Label-Set Operations Networks for Multi-Label Few-Shot Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Ankit Shah,et al.  DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[17]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[18]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[19]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[20]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.