An Open-set Recognition and Few-Shot Learning Dataset for Audio Event Classification in Domestic Environments

The problem of training a deep neural network with a small set of positive samples is known as few-shot learning (FSL). It is widely known that traditional deep learning (DL) algorithms usually show very good performance when trained with large datasets. However, in many applications, it is not possible to obtain such a high number of samples. In the image domain, typical FSL applications are those related to face recognition. In the audio domain, music fraud or speaker recognition can be clearly benefited from FSL methods. This paper deals with the application of FSL to the detection of specific and intentional acoustic events given by different types of sound alarms, such as door bells or fire alarms, using a limited number of samples. These sounds typically occur in domestic environments where many events corresponding to a wide variety of sound classes take place. Therefore, the detection of such alarms in a practical scenario can be considered an open-set recognition (OSR) problem. To address the lack of a dedicated public dataset for audio FSL, researchers usually make modifications on other available datasets. This paper is aimed at providing the audio recognition community with a carefully annotated dataset for FSL and OSR comprised of 1360 clips from 34 classes divided into pattern sounds and unwanted sounds. To facilitate and promote research in this area, results with two baseline systems (one trained from scratch and another based on transfer learning), are presented.

[1]  Yu Qiao,et al.  A Discriminative Feature Learning Approach for Deep Face Recognition , 2016, ECCV.

[2]  Maximo Cobos,et al.  On the Robustness of Deep Features for Audio Event Classification in Adverse Environments , 2018, 2018 14th IEEE International Conference on Signal Processing (ICSP).

[3]  Terrance E. Boult,et al.  Probability Models for Open Set Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[5]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[6]  Soo-Don Hyun,et al.  ACOUSTIC SCENE CLASSIFICATION USING PARALLEL COMBINATION OF LSTM AND CNN , 2016 .

[7]  Luiz Eduardo Soares de Oliveira,et al.  PKLot - A robust dataset for parking lot classification , 2015, Expert Syst. Appl..

[8]  Toni Heittola,et al.  DOMESTIC AUDIO TAGGING WITH CONVOLUTIONAL NEURAL NETWORKS , 2016 .

[9]  Esa Rahtu,et al.  Siamese network features for image matching , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[10]  Annamaria Mesaros,et al.  Acoustic Scene Classification in DCASE 2019 Challenge: Closed and Open Set Classification and Data Mismatch Setups , 2019, DCASE.

[11]  Maximo Cobos,et al.  A case study on feature sensitivity for audio event classification using support vector machines , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[12]  Yonghong Yan,et al.  Integrating the Data Augmentation Scheme with Various Classifiers for Acoustic Scene Modeling , 2019, ArXiv.

[13]  Yi-Hsuan Yang,et al.  Learning to Match Transient Sound Events Using Attentional Similarity for Few-shot Sound Recognition , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Terrance E. Boult,et al.  Towards Open Set Deep Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Ke Chen,et al.  Extracting Speaker-Specific Information with a Regularized Siamese Deep Network , 2011, NIPS.

[16]  Anil K. Jain,et al.  On-line signature verification, , 2002, Pattern Recognit..

[17]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Yong Qin,et al.  Few-Shot Audio Classification with Attentional Graph Neural Networks , 2019, INTERSPEECH.

[19]  Mei Wang,et al.  Deep Face Recognition: A Survey , 2018, Neurocomputing.

[20]  Tuomas Virtanen,et al.  Zero-Shot Audio Classification Based On Class Label Embeddings , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[21]  Marios Savvides,et al.  Ring Loss: Convex Feature Normalization for Face Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Ira Kemelmacher-Shlizerman,et al.  The MegaFace Benchmark: 1 Million Faces for Recognition at Scale , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Ricardo da Silva Torres,et al.  Nearest neighbors distance ratio open-set classifier , 2016, Machine Learning.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Xavier Serra,et al.  Training Neural Audio Classifiers with Few Data , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Xiaojuan Qi,et al.  ICNet for Real-Time Semantic Segmentation on High-Resolution Images , 2017, ECCV.

[28]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[29]  Huy Phan,et al.  Audio Scene Classification with Deep Recurrent Neural Networks , 2017, INTERSPEECH.

[30]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[31]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33]  Hervé Bredin,et al.  TristouNet: Triplet loss for speaker turn embedding , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Nir Ailon,et al.  Deep Metric Learning Using Triplet Network , 2014, SIMBAD.

[35]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Songcan Chen,et al.  Recent Advances in Open Set Recognition: A Survey , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Justin Salamon,et al.  Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Nicholas W. D. Evans,et al.  The open-set problem in acoustic scene classification , 2016, 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC).

[39]  Maximo Cobos,et al.  Adaptive Mid-Term Representations for Robust Audio Event Classification , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[40]  Alexander Rakowski FREQUENCY-AWARE CNN FOR OPEN SET ACOUSTIC SCENE CLASSIFICATION Technical Report , 2019 .

[41]  Changshui Zhang,et al.  Deep ranking: Triplet MatchNet for music metric learning , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).