ASe: Acoustic Scene Embedding Using Deep Archetypal Analysis and GMM

In this paper, we propose a deep learning framework which combines the generalizability of Gaussian mixture models (GMM) and discriminative power of deep matrix factorization to learn acoustic scene embedding (ASe) for the acoustic scene classification task. The proposed approach first builds a Gaussian mixture model-universal background model (GMMUBM) using frame-wise spectral representations. This UBM is adapted to a waveform, and the likelihood for each spectral frame representation is stored as a feature matrix. This matrix is fed to a deep matrix factorization pipeline (with audio recording level max-pooling) to compute a sparse-convex discriminative representation. The proposed deep factorization model is based on archetypal analysis, a form of convex NMF, which has been shown to be well suited for audio analysis. Finally, the obtained representation is mapped to a class label using a dictionary based auto-encoder consisting of linear and symmetric encoder and decoder with an efficient learning algorithm. The encoder projects the ASe of a waveform to the label space, while the decoder ensures that the feature can be reconstructed, resulting in better generalization on the test data.

[1]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Yanjun Qi,et al.  Unsupervised Feature Learning by Deep Sparse Coding , 2013, SDM.

[3]  George Trigeorgis,et al.  A Deep Matrix Factorization Method for Learning Attribute Representations , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Shao-Hu Peng,et al.  Acoustic Scene Classification Using Deep Convolutional Neural Network and Multiple Spectrograms Fusion , 2017, DCASE.

[5]  Isabel Trancoso,et al.  Detecting audio events for semantic video search , 2009, INTERSPEECH.

[6]  Padmanabhan Rajan,et al.  Compressed Convex Spectral Embedding for Bird Species Classification , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Hanseok Ko,et al.  Acoustic Scene Classification Based on Convolutional Neural Network Using Double Image Features , 2017, DCASE.

[8]  Anil Kumar Sao,et al.  Deep-Sparse-Representation-Based Features for Speech Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Gaël Richard,et al.  Feature Learning With Matrix Factorization Applied to Acoustic Scene Classification , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Zaïd Harchaoui,et al.  Fast and Robust Archetypal Analysis for Representation Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Gaël Richard,et al.  Acoustic scene classification with matrix factorization for unsupervised feature learning , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Gaël Richard,et al.  HOG and subband power distribution image features for acoustic scene classification , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[13]  Gerhard Widmer,et al.  CLASSIFYING SHORT ACOUSTIC SCENES WITH I-VECTORS AND CNNS : CHALLENGES AND OPTIMISATIONS FOR THE 2017 DCASE ASC TASK , 2017 .

[14]  Seongkyu Mun,et al.  GENERATIVE ADVERSARIAL NETWORK BASED ACOUSTIC SCENE TRAINING SET AUGMENTATION AND SELECTION USING SVM HYPERPLANE , 2017 .

[15]  Mathieu Lagrange,et al.  Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), Budapest, Hungary, 3 Sep 2016. , 2016 .

[16]  Kyogu Lee,et al.  Convolutional Neural Networks with Binaural Representations and Background Subtraction for Acoustic Scene Classification , 2017, DCASE.

[17]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Alain Rakotomamonjy,et al.  Histogram of gradients of Time-Frequency Representations for Audio scene detection , 2015, ArXiv.

[19]  Birger Kollmeier,et al.  Classifier Architectures for Acoustic Scenes and Events: Implications for DNNs, TDNNs, and Perceptual Features from DCASE 2016 , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Björn W. Schuller,et al.  Large-scale audio feature extraction and SVM for acoustic scene classification , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[21]  Shaogang Gong,et al.  Semantic Autoencoder for Zero-Shot Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Gaël Richard,et al.  Nonnegative Feature Learning Methods for Acoustic Scene Classification , 2017 .

[23]  Vesa T. Peltonen,et al.  Audio-based context recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Vinayak Abrol,et al.  GMM-AA SYSTEM FOR ACOUSTIC SCENE CLASSIFICATION , 2017 .