Acoustic Scene Classification by Implicitly Identifying Distinct Sound Events

In this paper, we propose a new strategy for acoustic scene classification (ASC) , namely recognizing acoustic scenes through identifying distinct sound events. This differs from existing strategies, which focus on characterizing global acoustical distributions of audio or the temporal evolution of short-term audio features, without analysis down to the level of sound events. To identify distinct sound events for each scene, we formulate ASC in a multi-instance learning (MIL) framework, where each audio recording is mapped into a bag-of-instances representation. Here, instances can be seen as high-level representations for sound events inside a scene. We also propose a MIL neural networks model, which implicitly identifies distinct instances (i.e., sound events). Furthermore, we propose two specially designed modules that model the multi-temporal scale and multi-modal natures of the sound events respectively. The experiments were conducted on the official development set of the DCASE2018 Task1 Subtask B, and our best-performing model improves over the official baseline by 9.4% (68.3% vs 58.9%) in terms of classification accuracy. This study indicates that recognizing acoustic scenes by identifying distinct sound events is effective and paves the way for future studies that combine this strategy with previous ones.

[1]  Hongwei Song,et al.  A Compact and Discriminative Feature Based on Auditory Summary Statistics for Acoustic Scene Classification , 2018, INTERSPEECH.

[2]  Mark D. Plumbley,et al.  Acoustic Scene Classification: Classifying environments from the sounds they produce , 2014, IEEE Signal Processing Magazine.

[3]  Anssi Klapuri,et al.  Recognition of Everyday Auditory Scenes: Potentials, Latencies and Cues , 2001 .

[4]  Florian Metze,et al.  A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Lukás Burget,et al.  Convolutional Neural Networks and x-vector Embedding for DCASE2018 Acoustic Scene Classification Challenge , 2018, ArXiv.

[6]  Lie Lu,et al.  A flexible framework for key audio effects detection and auditory context inference , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[8]  Roberto Togneri,et al.  Enhanced LBP texture features from time frequency representations for acoustic scene classification , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Gerhard Widmer,et al.  A hybrid approach with multi-channel i-vectors and convolutional neural networks for acoustic scene classification , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[10]  Yonghong Yan,et al.  Deep Convolutional Neural Network with Scalogram for Audio Scene Modeling , 2018, INTERSPEECH.

[11]  Ji Wu,et al.  Data Independent Sequence Augmentation Method for Acoustic Scene Classification , 2018, INTERSPEECH.

[12]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[13]  Ji Feng,et al.  Deep MIML Network , 2017, AAAI.

[14]  Tuomas Virtanen,et al.  Audio context recognition using audio event histograms , 2010, 2010 18th European Signal Processing Conference.

[15]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[16]  Bhiksha Raj,et al.  Audio Event Detection using Weakly Labeled Data , 2016, ACM Multimedia.

[17]  Yun Wang Polyphonic Sound Event Detection with Weak Labeling , 2017 .

[18]  Tuomas Virtanen,et al.  A multi-device dataset for urban acoustic scene classification , 2018, DCASE.

[19]  Vinayak Abrol,et al.  ASe: Acoustic Scene Embedding Using Deep Archetypal Analysis and GMM , 2018, INTERSPEECH.

[20]  Xinxing Chen,et al.  ACOUSTIC SCENE CLASSIFICATION USING MULTI-SCALE FEATURES Technical Report , 2018 .

[21]  Ji Wu,et al.  Multi-modal Attention Mechanisms in LSTM and Its Application to Acoustic Scene Classification , 2018, INTERSPEECH.

[22]  Vesa T. Peltonen,et al.  Audio-based context recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Jaume Amores,et al.  Multiple instance classification: Review, taxonomy and comparative study , 2013, Artif. Intell..

[24]  Alain Rakotomamonjy,et al.  Histogram of gradients of Time-Frequency Representations for Audio scene detection , 2015, ArXiv.

[25]  Mark D. Plumbley,et al.  Attention-based convolutional neural networks for acoustic scene classification , 2018, DCASE.

[26]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[27]  Xiaoli Z. Fern,et al.  Acoustic classification of multiple simultaneous bird species: a multi-instance multi-label approach. , 2012, The Journal of the Acoustical Society of America.

[28]  Gaël Richard,et al.  Feature Learning With Matrix Factorization Applied to Acoustic Scene Classification , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  Justin Salamon,et al.  Adaptive Pooling Operators for Weakly Labeled Sound Event Detection , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  Bhiksha Raj,et al.  Audio event and scene recognition: A unified approach using strongly and weakly labeled data , 2016, 2017 International Joint Conference on Neural Networks (IJCNN).

[31]  Ji Wu,et al.  Temporal Transformer Networks for Acoustic Scene Classification , 2018, INTERSPEECH.

[32]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[33]  François Pachet,et al.  The bag-of-frames approach to audio pattern recognition: a sufficient model for urban soundscapes but not for polyphonic music. , 2007, The Journal of the Acoustical Society of America.

[34]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[35]  Florian Metze,et al.  A comparison of Deep Learning methods for environmental sound detection , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).