Joint Analysis of Sound Events and Acoustic Scenes Using Multitask Learning

Sound event detection (SED) and acoustic scene classification (ASC) are important research topics in environmental sound analysis. Many research groups have addressed SED and ASC using neural-network-based methods, such as the convolutional neural network (CNN), recurrent neural network (RNN), and convolutional recurrent neural network (CRNN). The conventional methods address SED and ASC separately even though sound events and acoustic scenes are closely related to each other. For example, in the acoustic scene "office," the sound events "mouse clicking" and "keyboard typing" are likely to occur. Therefore, it is expected that information on sound events and acoustic scenes will be of mutual aid for SED and ASC. In this paper, we propose multitask learning for joint analysis of sound events and acoustic scenes, in which the parts of the networks holding information on sound events and acoustic scenes in common are shared. Experimental results obtained using the TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016 datasets indicate that the proposed method improves the performance of SED and ASC by 1.31 and 1.80 percentage points in terms of the F-score, respectively, compared with the conventional CRNN-based method.

[1]  Vesa T. Peltonen,et al.  Audio-based context recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Tuomas Virtanen,et al.  Context-dependent sound event detection , 2013, EURASIP Journal on Audio, Speech, and Music Processing.

[3]  Ankit Shah,et al.  DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[4]  Bart Vanrumste,et al.  An exemplar-based NMF approach to audio event detection , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[5]  Emmanuel Vincent,et al.  Sound Event Detection in the DCASE 2017 Challenge , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Yong Xu,et al.  Surrey-cvssp system for DCASE2017 challenge task4 , 2017, ArXiv.

[7]  Cheung-Fat Chan,et al.  An abnormal sound detection and classification system for surveillance applications , 2010, 2010 18th European Signal Processing Conference.

[8]  Julien Pinquier,et al.  Water sound recognition based on physical models , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Anssi Klapuri,et al.  Latent semantic analysis in sound event detection , 2011, 2011 19th European Signal Processing Conference.

[10]  Emmanouil Benetos,et al.  Polyphonic Sound Event and Sound Activity Detection: A Multi-Task Approach , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[11]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Alain Rakotomamonjy,et al.  Histogram of gradients of Time-Frequency Representations for Audio scene detection , 2015, ArXiv.

[13]  Nikos Fakotakis,et al.  On acoustic surveillance of hazardous situations , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Ching-Yung Lin,et al.  Healthcare audio event classification using Hidden Markov Models and Hierarchical Hidden Markov Models , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[15]  Heikki Huttunen,et al.  Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[17]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Keisuke Imoto,et al.  Introduction to acoustic event and scene analysis , 2018 .

[19]  Ryosuke Yamanishi,et al.  Joint Analysis of Acoustic Event and Scene Based on Multitask Learning , 2019, ArXiv.

[20]  Suehiro Shimauchi,et al.  Acoustic Scene Analysis Based on Hierarchical Generative Model of Acoustic Event Sequence , 2016, IEICE Trans. Inf. Syst..

[21]  Xiangdong Wang,et al.  Guided Learning Convolution System for DCASE 2019 Task 4 , 2019, DCASE.

[22]  Reishi Kondo,et al.  Acoustic Event Detection Method Using Semi-Supervised Non-Negative Matrix Factorization with Mixtures of Local Dictionaries , 2016, DCASE.

[23]  Nobutaka Ono,et al.  Acoustic Topic Model for Scene Analysis With Intermittently Missing Observations , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Seisuke Kyochi,et al.  Sound Event Detection Using Graph Laplacian Regularization Based on Event Co-occurrence , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Kai Oliver Arras,et al.  Audio-based human activity recognition using Non-Markovian Ensemble Voting , 2012, 2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication.

[26]  Sungrack Yun,et al.  Discriminative training of GMM parameters for audio scene classification and audio tagging , 2016 .

[27]  Nicolas Turpault,et al.  Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments , 2018, DCASE.

[28]  Tomoki Toda,et al.  Duration-Controlled LSTM for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  Archontis Politis,et al.  Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks , 2018, IEEE Journal of Selected Topics in Signal Processing.

[30]  Tuomas Virtanen,et al.  Acoustic Scene Classification: An Overview of Dcase 2017 Challenge Entries , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[31]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[32]  Mark Hasegawa-Johnson,et al.  Acoustic fall detection using Gaussian mixture models and GMM supervectors , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  Jasha Droppo,et al.  Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Suehiro Shimauchi,et al.  User activity estimation method based on probabilistic generative model of acoustic event sequence with user activity and its subordinate categories , 2013, INTERSPEECH.

[35]  Bin Ma,et al.  Convolutional neural network with multi-task learning scheme for acoustic scene classification , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[36]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[37]  Janto Skowronek,et al.  Automatic surveillance of the acoustic activity in our living environment , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[38]  Tuomas Virtanen,et al.  Acoustic event detection in real life recordings , 2010, 2010 18th European Signal Processing Conference.