Deep Scalogram Representations for Acoustic Scene Classification

Spectrogram representations of acoustic scenes have achieved competitive performance for acoustic scene classification. Yet, the spectrogram alone does not take into account a substantial amount of time-frequency information. In this study, we present an approach for exploring the benefits of deep scalogram representations, extracted in segments from an audio stream. The approach presented firstly transforms the segmented acoustic scenes into bump and morse scalograms, as well as spectrograms; secondly, the spectrograms or scalograms are sent into pre-trained convolutional neural networks; thirdly, the features extracted from a subsequent fully connected layer are fed into U+0028 bidirectional U+0029 gated recurrent neural networks, which are followed by a single highway layer and a softmax layer; finally, predictions from these three systems are fused by a margin sampling value strategy. We then evaluate the proposed approach using the acoustic scene classification data set of 2017 IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events U+0028 DCASE U+0029. On the evaluation set, an accuracy of 64.0 U+0025 from bidirectional gated recurrent neural networks is obtained when fusing the spectrogram and the bump scalogram, which is an improvement on the 61.0 U+0025 baseline result provided by the DCASE 2017 organisers. This result shows that extracted bump scalograms are capable of improving the classification accuracy, when fusing with a spectrogram-based system.

[1]  K. I. Ramachandran,et al.  Effective Heart Sound Segmentation and Murmur Classification Using Empirical Wavelet Transform and Instantaneous Phase for Electronic Stethoscope , 2017, IEEE Sensors Journal.

[2]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[3]  Gamini Dissanayake,et al.  Driver Drowsiness Classification Using Fuzzy Wavelet-Packet-Based Feature-Extraction Algorithm , 2011, IEEE Transactions on Biomedical Engineering.

[4]  Soo-Don Hyun,et al.  ACOUSTIC SCENE CLASSIFICATION USING PARALLEL COMBINATION OF LSTM AND CNN , 2016 .

[5]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[6]  Dongpu Cao,et al.  Parallel driving in CPSS: a unified approach for transport automation and vehicle intelligence , 2017, IEEE/CAA Journal of Automatica Sinica.

[7]  Björn Schuller,et al.  Sequence to Sequence Autoencoders for Unsupervised Representation Learning from Audio , 2017, DCASE.

[8]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[9]  Björn Schuller,et al.  Deep Sequential Image Features on Acoustic Scene Classification , 2017, DCASE.

[10]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[11]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[12]  Björn W. Schuller,et al.  Wavelet features for classification of vote snore sounds , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Grzegorz Gwardys,et al.  Deep Image Features in Music Information Retrieval , 2014 .

[14]  Seongkyu Mun,et al.  GENERATIVE ADVERSARIAL NETWORK BASED ACOUSTIC SCENE TRAINING SET AUGMENTATION AND SELECTION USING SVM HYPERPLANE , 2017 .

[15]  C. L. Philip Chen,et al.  A survey of human-centered intelligent robots: issues and challenges , 2017, IEEE/CAA Journal of Automatica Sinica.

[16]  Ingrid Daubechies,et al.  The wavelet transform, time-frequency localization and signal analysis , 1990, IEEE Trans. Inf. Theory.

[17]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Stefan Wrobel,et al.  Active Hidden Markov Models for Information Extraction , 2001, IDA.

[20]  Jürgen Schmidhuber,et al.  Highway Networks , 2015, ArXiv.

[21]  Björn W. Schuller,et al.  Snore Sound Classification Using Image-Based Deep Spectrum Features , 2017, INTERSPEECH.

[22]  Huaguang Zhang,et al.  Weather prediction with multiclass support vector machines in the fault detection of photovoltaic system , 2017, IEEE/CAA Journal of Automatica Sinica.

[23]  Hanseok Ko,et al.  Acoustic Scene Classification Based on Convolutional Neural Network Using Double Image Features , 2017, DCASE.

[24]  Dimitrios Tzovaras,et al.  Acoustic Scene Classification: From a Hybrid Classifier to Deep Learning , 2017, DCASE.

[25]  Sebastian Böck,et al.  Improved musical onset detection with Convolutional Neural Networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Xavier Serra,et al.  Acoustic Scene Classification by Ensembling Gradient Boosting Machine and Convolutional Neural Networks , 2017, DCASE.

[27]  Björn W. Schuller,et al.  Snore sound recognition: On wavelets and classifiers from deep nets to kernels , 2017, 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[28]  S. Squartini,et al.  DCASE 2016 Acoustic Scene Classification Using Convolutional Neural Networks , 2016, DCASE.

[29]  Björn Schuller,et al.  Wavelets Revisited for the Classification of Acoustic Scenes , 2017, DCASE.

[30]  Ankit Shah,et al.  DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[31]  Björn Schuller,et al.  Active learning for bird sound classification via a kernel-based extreme learning machine. , 2017, The Journal of the Acoustical Society of America.

[32]  Wojciech Zaremba,et al.  An Empirical Exploration of Recurrent Network Architectures , 2015, ICML.

[33]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[34]  Sofia C. Olhede,et al.  Generalized Morse wavelets , 2002, IEEE Trans. Signal Process..

[35]  Jin Jiang,et al.  Time-frequency feature representation using energy concentration: An overview of recent advances , 2009, Digit. Signal Process..

[36]  Björn W. Schuller,et al.  The University of Passau Open Emotion Recognition System for the Multimodal Emotion Challenge , 2016, CCPR.

[37]  Nanning Zheng,et al.  Parallel learning: a perspective and a framework , 2017, IEEE/CAA Journal of Automatica Sinica.

[38]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[39]  Franz Pernkopf,et al.  Gated Recurrent Networks applied to Acoustic Scene Classification , 2016, DCASE.

[40]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[41]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[42]  Dong Yu,et al.  Recent progresses in deep learning based acoustic models , 2017, IEEE/CAA Journal of Automatica Sinica.

[43]  Björn W. Schuller,et al.  Classification of the Excitation Location of Snore Sounds in the Upper Airway by Acoustic Multifeature Analysis , 2017, IEEE Transactions on Biomedical Engineering.

[44]  Fabien Ringeval,et al.  Pairwise Decomposition with Deep Neural Networks and Multiscale Kernel Subspace Learning for Acoustic Scene Classification , 2016, DCASE.

[45]  Danilo P. Mandic,et al.  Recurrent Neural Networks for Prediction: Learning Algorithms, Architectures and Stability , 2001 .

[46]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[47]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[48]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[49]  Toan H. Vu,et al.  ACOUSTIC SCENE AND EVENT RECOGNITION USING RECURRENT NEURAL NETWORKS , 2016 .

[50]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.