Semi-supervised Triplet Loss Based Learning of Ambient Audio Embeddings

Deep neural networks are particularly useful to learn relevant representations from data. Recent studies have demonstrated the potential of unsupervised representation learning for ambient sound analysis using various flavors of the triplet loss. They have compared this approach to supervised learning. However, in real situations, it is common to have a small labeled dataset and a large unlabeled one. In this paper, we combine unsupervised and supervised triplet loss based learning into a semi-supervised representation learning approach. We propose two flavors of this approach, whereby the positive samples for those triplets whose anchors are unlabeled are obtained either by applying a transformation to the anchor, or by selecting the nearest sample in the training set. We compare our approach to supervised and unsupervised representation learning as well as the ratio between the amount of labeled and unlabeled data. We evaluate all the above approaches on an audio tagging task using the DCASE 2018 Task 4 dataset, and we show the impact of this ratio on the tagging performance.

[1]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Sunil Thulasidasan,et al.  Acoustic classification using semi-supervised Deep Neural Networks and stochastic entropy-regularization over nearest-neighbor graphs , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Björn W. Schuller,et al.  Semi-supervised learning helps in sound event classification , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Yang Song,et al.  Learning Fine-Grained Image Similarity with Deep Ranking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Justin Salamon,et al.  Scaper: A library for soundscape synthesis and augmentation , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[6]  Changshui Zhang,et al.  Deep ranking: Triplet MatchNet for music metric learning , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Nicolas Turpault,et al.  Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments , 2018, DCASE.

[8]  Hervé Bredin,et al.  TristouNet: Triplet loss for speaker turn embedding , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Daniel P. W. Ellis,et al.  General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline , 2018, DCASE.

[10]  Justin Salamon,et al.  Unsupervised feature learning for urban sound classification , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Lu Jiakai,et al.  MEAN TEACHER CONVOLUTION SYSTEM FOR DCASE 2018 TASK 4 , 2018 .

[12]  Corentin Dancette,et al.  Sampling strategies in Siamese Networks for unsupervised speech representation learning , 2018, INTERSPEECH.

[13]  Qiang Huang,et al.  Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Reishi Kondo,et al.  Acoustic Event Detection Method Using Semi-Supervised Non-Negative Matrix Factorization with Mixtures of Local Dictionaries , 2016, DCASE.

[15]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[16]  Xavier Serra,et al.  Freesound Datasets: A Platform for the Creation of Open Audio Datasets , 2017, ISMIR.

[17]  Shivani Agarwal,et al.  An Experimental Study of Semi-Supervised EM algorithms in Audio Classification and Speaker Identification , 2003 .

[18]  Aren Jansen,et al.  Unsupervised Learning of Semantic Audio Representations , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Gerhard Widmer,et al.  Training general-purpose audio tagging networks with noisy labels and iterative self-verification , 2018, DCASE.

[20]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[21]  Colin Raffel,et al.  Realistic Evaluation of Deep Semi-Supervised Learning Algorithms , 2018, NeurIPS.

[22]  Muriel Visani,et al.  Simple Triplet Loss Based on Intra/Inter-Class Metric Learning for Face Verification , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[23]  Yong Xu,et al.  Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Tuomas Virtanen,et al.  Sound Event Detection in Multichannel Audio Using Spatial and Harmonic Features , 2017, DCASE.