Metric Learning with Background Noise Class for Few-Shot Detection of Rare Sound Events

Few-shot learning systems for sound event recognition have gained interests since they require only a few examples to adapt to new target classes without fine-tuning. However, such systems have only been applied to chunks of sounds for classification or verification. In this paper, we aim to achieve few-shot detection of rare sound events, from query sequence that contain not only the target events but also the other events and background noise. Therefore, it is required to prevent false positive reactions to both the other events and background noise. We propose metric learning with background noise class for the few-shot detection. The contribution is to present the explicit inclusion of background noise as an independent class, a suitable loss function that emphasizes this additional class, and a corresponding sampling strategy that assists training. It provides a feature space where the event classes and the background noise class are sufficiently separated. Evaluations on few-shot detection tasks, using DCASE 2017 task2 and ESC-50, show that our proposed method outperforms metric learning without considering the background noise class. The few-shot detection performance is also comparable to that of the DCASE 2017 task2 baseline system, which requires huge amount of annotated audio data.

[1]  Huy Phan,et al.  Weighted and Multi-Task Loss for Rare Audio Event Detection , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Nir Ailon,et al.  Deep Metric Learning Using Triplet Network , 2014, SIMBAD.

[3]  Tiago H. Falk,et al.  Combining Speaker Recognition and Metric Learning for Speaker-Dependent Representation Learning , 2019, INTERSPEECH.

[4]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[5]  Noboru Harada,et al.  SNIPER: Few-shot Learning for Anomaly Detection to Minimize False-negative Rate with Ensured True-positive Rate , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Padmanabhan Rajan,et al.  Multiscale CNN based Deep Metric Learning for Bioacoustic Classification: Overcoming Training Data Scarcity Using Dynamic Triplet Loss , 2019, The Journal of the Acoustical Society of America.

[8]  Hisashi Kawai,et al.  Class-Wise Centroid Distance Metric Learning for Acoustic Event Detection , 2019, INTERSPEECH.

[9]  Yang Song,et al.  Learning Fine-Grained Image Similarity with Deep Ranking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Yi-Hsuan Yang,et al.  Event Localization in Music Auto-tagging , 2016, ACM Multimedia.

[11]  Ruxin Chen,et al.  Hierarchy-aware Loss Function on a Tree Structured Label Space for Audio Event Detection , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[13]  Yi-Hsuan Yang,et al.  Learning to Match Transient Sound Events Using Attentional Similarity for Few-shot Sound Recognition , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[15]  Hakan Erdogan,et al.  Investigations on Data Augmentation and Loss Functions for Deep Learning Based Speech-Background Separation , 2018, INTERSPEECH.

[16]  Bhiksha Raj,et al.  Content-Based Representations of Audio Using Siamese Neural Networks , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Xavier Serra,et al.  Training Neural Audio Classifiers with Few Data , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Gregory R. Koch,et al.  Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[19]  Yong Qin,et al.  Few-Shot Audio Classification with Attentional Graph Neural Networks , 2019, INTERSPEECH.

[20]  Chao Wang,et al.  R-CRNN: Region-based Convolutional Recurrent Neural Network for Audio Event Detection , 2018, INTERSPEECH.

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Vipul Arora,et al.  Deep Embeddings for Rare Audio Event Detection with Imbalanced Data , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Emmanuel Vincent,et al.  Semi-supervised Triplet Loss Based Learning of Ambient Audio Embeddings , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Hyun-Jin Park,et al.  End-to-end Streaming Keyword Spotting , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Frank Rudzicz,et al.  Centroid-based Deep Metric Learning for Speaker Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Kyogu Lee,et al.  Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks , 2017, DCASE.

[27]  Nicu Sebe,et al.  Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[28]  Jérôme Louradour,et al.  Audio Events Detection in Public Transport Vehicle , 2006, 2006 IEEE Intelligent Transportation Systems Conference.

[29]  T. Virtanen,et al.  Convolutional Recurrent Neural Networks for Rare Sound Event Detection , 2017, DCASE.

[30]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[31]  Aren Jansen,et al.  Unsupervised Learning of Semantic Audio Representations , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[33]  Vidhyasaharan Sethu,et al.  Deep Siamese Architecture Based Replay Detection for Secure Voice Biometric , 2018, INTERSPEECH.

[34]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.