Acoustic Scene Clustering Using Joint Optimization of Deep Embedding Learning and Clustering Iteration

Recent efforts have been made on acoustic scene classification in the audio signal processing community. In contrast, few studies have been conducted on acoustic scene clustering, which is a newly emerging problem. Acoustic scene clustering aims at merging the audio recordings of the same class of acoustic scene into a single cluster without using prior information and training classifiers. In this study, we propose a method for acoustic scene clustering that jointly optimizes the procedures of feature learning and clustering iteration. In the proposed method, the learned feature is a deep embedding that is extracted from a deep convolutional neural network (CNN), while the clustering algorithm is the agglomerative hierarchical clustering (AHC). We formulate a unified loss function for integrating and optimizing these two procedures. Various features and methods are compared. The experimental results demonstrate that the proposed method outperforms other unsupervised methods in terms of the normalized mutual information and the clustering accuracy. In addition, the deep embedding outperforms many state-of-the-art features.

[1]  Alain Rakotomamonjy,et al.  Supervised Representation Learning for Audio Scene Classification , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Mark D. Plumbley,et al.  Acoustic Scene Classification: Classifying environments from the sounds they produce , 2014, IEEE Signal Processing Magazine.

[3]  Ankit Shah,et al.  DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[4]  Gerhard Widmer,et al.  CP-JKU SUBMISSIONS FOR DCASE-2016 : A HYBRID APPROACH USING BINAURAL I-VECTORS AND DEEP CONVOLUTIONAL NEURAL NETWORKS , 2016 .

[5]  Hanseok Ko,et al.  Acoustic Scene Classification Based on Convolutional Neural Network Using Double Image Features , 2017, DCASE.

[6]  Dhruv Batra,et al.  Joint Unsupervised Learning of Deep Representations and Image Clusters , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Gaël Richard,et al.  Feature Learning With Matrix Factorization Applied to Acoustic Scene Classification , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Edward Y. Chang,et al.  Parallel Spectral Clustering in Distributed Systems , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Mathieu Lagrange,et al.  Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Sridhar Krishnan,et al.  Combining Temporal Features by Local Binary Pattern for Acoustic Scene Classification , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Qian Huang,et al.  Using multi-stream hierarchical deep neural network to extract deep audio feature for acoustic event detection , 2016, Multimedia Tools and Applications.

[13]  Lie Lu,et al.  Audio Keywords Discovery for Text-Like Audio Content Analysis and Retrieval , 2008, IEEE Transactions on Multimedia.

[14]  S. Squartini,et al.  DCASE 2016 Acoustic Scene Classification Using Convolutional Neural Networks , 2016, DCASE.

[15]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[16]  Roberto Togneri,et al.  Spectrotemporal Analysis Using Local Binary Pattern Variants for Acoustic Scene Classification , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Wenwu Wang,et al.  Randomly Sketched Sparse Subspace Clustering for Acoustic Scene Clustering , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[18]  Björn Schuller,et al.  Deep Sequential Image Features on Acoustic Scene Classification , 2017, DCASE.

[19]  Qian Huang,et al.  Unsupervised detection of acoustic events using information bottleneck principle , 2017, Digit. Signal Process..

[20]  Franz Pernkopf,et al.  Acoustic scene classification using a convolutional neural network ensemble and nearest neighbor filters , 2018, DCASE.

[21]  Deli Zhao,et al.  Graph Degree Linkage: Agglomerative Clustering on a Directed Graph , 2012, ECCV.

[22]  Birger Kollmeier,et al.  Classifier Architectures for Acoustic Scenes and Events: Implications for DNNs, TDNNs, and Perceptual Features from DCASE 2016 , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Björn W. Schuller,et al.  A Fusion of Deep Convolutional Generative Adversarial Networks and Sequence to Sequence Autoencoders for Acoustic Scene Classification , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[24]  Emmanuel Vincent,et al.  Acoustic Scene Classification by Combining Autoencoder-Based Dimensionality Reduction and Convolutional Neural Networks , 2017, DCASE.

[25]  Arnav Bhavsar,et al.  A Layer-wise Score Level Ensemble Framework for Acoustic Scene Classification , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[26]  Juhan Nam,et al.  Multi-Level and Multi-Scale Feature Aggregation Using Pretrained Convolutional Neural Networks for Music Auto-Tagging , 2017, IEEE Signal Processing Letters.

[27]  Yuhan Zhang,et al.  Anomalous Sound Detection Using Deep Audio Representation and a BLSTM Network for Audio Surveillance of Roads , 2018, IEEE Access.

[28]  Alain Rakotomamonjy,et al.  Histogram of gradients of Time-Frequency Representations for Audio scene detection , 2015, ArXiv.

[29]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[30]  D.P. Skinner,et al.  The cepstrum: A guide to processing , 1977, Proceedings of the IEEE.

[31]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[32]  Dong Yu,et al.  Automatic Speech Recognition: A Deep Learning Approach , 2014 .

[33]  Vittorio Murino,et al.  Audio Surveillance , 2014, ACM Comput. Surv..

[34]  Bhiksha Raj,et al.  Acoustic Scene Classification Using Discrete Random Hashing for Laplacian Kernel Machines , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Yuhan Zhang,et al.  Acoustic Scene Classification Using Deep Audio Feature and BLSTM Network , 2018, 2018 International Conference on Audio, Language and Image Processing (ICALIP).

[36]  Geoffrey E. Hinton,et al.  Learning representations of back-propagation errors , 1986 .

[37]  VirtanenTuomas,et al.  Detection and Classification of Acoustic Scenes and Events , 2018 .

[38]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.