Polyphonic audio tagging with sequentially labelled data using CRNN with learnable gated linear units

Audio tagging aims to detect the types of sound events occurring in an audio recording. To tag the polyphonic audio recordings, we propose to use Connectionist Temporal Classification (CTC) loss function on the top of Convolutional Recurrent Neural Network (CRNN) with learnable Gated Linear Units (GLUCTC), based on a new type of audio label data: Sequentially Labelled Data (SLD). In GLU-CTC, CTC objective function maps the frame-level probability of labels to clip-level probability of labels. To compare the mapping ability of GLU-CTC for sound events, we train a CRNN with GLU based on Global Max Pooling (GLU-GMP) and a CRNN with GLU based on Global Average Pooling (GLU-GAP). And we also compare the proposed GLU-CTC system with the baseline system, which is a CRNN trained using CTC loss function without GLU. The experiments show that the GLU-CTC achieves an Area Under Curve (AUC) score of 0.882 in audio tagging, outperforming the GLU-GMP of 0.803, GLU-GAP of 0.766 and baseline system of 0.837. That means based on the same CRNN model with GLU, the performance of CTC mapping is better than the GMP and GAP mapping. Given both based on the CTC mapping, the CRNN with GLU outperforms the CRNN without GLU.

[1]  Mahdieh Soleymani Baghshah,et al.  Multi-label classification with feature-aware implicit encoding and generalized cross-entropy loss , 2016, 2016 24th Iranian Conference on Electrical Engineering (ICEE).

[2]  Takumi Kobayashi,et al.  Acoustic Scene Classification based on Sound Textures and Events , 2015, ACM Multimedia.

[3]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[4]  Thomas Lidy,et al.  CQT-based Convolutional Neural Networks for Audio Scene Classification , 2016, DCASE.

[5]  Yong Xu,et al.  Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Mark B. Sandler,et al.  Automatic Tagging Using Deep Convolutional Neural Networks , 2016, ISMIR.

[8]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[9]  Guodong Guo,et al.  Content-based audio classification and retrieval by support vector machines , 2003, IEEE Trans. Neural Networks.

[10]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[11]  Birger Kollmeier,et al.  On the use of spectro-temporal features for the IEEE AASP challenge ‘detection and classification of acoustic scenes and events’ , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[12]  S. Squartini,et al.  DCASE 2016 Acoustic Scene Classification Using Convolutional Neural Networks , 2016, DCASE.

[13]  Christoph H. Lampert,et al.  Seed, Expand and Constrain: Three Principles for Weakly-Supervised Image Segmentation , 2016, ECCV.

[14]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[15]  Qiang Huang,et al.  Attention and Localization Based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging , 2017, INTERSPEECH.

[16]  Bhiksha Raj,et al.  Audio Event Detection using Weakly Labeled Data , 2016, ACM Multimedia.

[17]  Svilen Dimitrov,et al.  Analyzing Sounds of Home Environment for Device Recognition , 2014, AmI.

[18]  Kyogu Lee,et al.  Acoustic scene classification using convolutional neural network and multiple-width frequency-delta data augmentation , 2016, ArXiv.

[19]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[20]  Ankit Shah,et al.  DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[21]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[22]  Daniele Battaglino,et al.  Acoustic scene classification using convolutional neural networks , 2016 .

[23]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[24]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[25]  Yong Xu,et al.  Sound Event Detection and Time–Frequency Segmentation from Weakly Labelled Data , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.