Investigation of acoustic and visual features for acoustic scene classification

Abstract Acoustic scene classification has gained great interests in recent years due to its diverse applications. Various acoustic and visual features have been proposed and evaluated. However, few studies have investigated acoustic and visual feature aggregation for acoustic scene classification. In this paper, we investigated various feature sets based on the fusion of acoustic and visual features. Specifically, acoustic features are directly extracted from the waveform: spectral centroid, spectral entropy, spectral flux, spectral roll-off, short-time energy, zero-crossing rate, and Mel-frequency Cepstral coefficients. For visual features, we calculate local binary pattern, histogram of gradients, and moments based on the audio scene time-frequency representation. Then, three feature selection algorithms are applied to various feature sets to reduce feature dimensionality: correlation-based feature selection, principal component analysis, and ReliefF. Experimental results show that our proposed system was able to achieve an accuracy improvement of 15.43% compared to the baseline system with the development set. When all development sets are used for training, the performance based on the evaluation set provided by the TUT Acoustic scene 2016 challenge is 87.44%, which is the fourth best among all non-neural network systems.

[1]  Jyh-Shing Roger Jang,et al.  Combining Visual and Acoustic Features for Music Genre Classification , 2011, 2011 10th International Conference on Machine Learning and Applications and Workshops.

[2]  Jozef Juhár,et al.  Feature selection for acoustic events detection , 2013, Multimedia Tools and Applications.

[3]  Soo-Young Lee,et al.  Environmental audio scene and activity recognition through mobile-based crowdsourcing , 2012, IEEE Transactions on Consumer Electronics.

[4]  Judith C. Brown Calculation of a constant Q spectral transform , 1991 .

[5]  Luiz Eduardo Soares de Oliveira,et al.  An evaluation of Convolutional Neural Networks for music classification using spectrograms , 2017, Appl. Soft Comput..

[6]  Ming-Kuei Hu,et al.  Visual pattern recognition by moment invariants , 1962, IRE Trans. Inf. Theory.

[7]  Alain Rakotomamonjy,et al.  Histogram of gradients of Time-Frequency Representations for Audio scene detection , 2015, ArXiv.

[8]  Loris Nanni,et al.  Combining visual and acoustic features for music genre classification , 2016, Expert Syst. Appl..

[9]  Alexandros Iosifidis,et al.  On the kernel Extreme Learning Machine classifier , 2015, Pattern Recognit. Lett..

[10]  Laurent Girin,et al.  Sound representation and classification benchmark for domestic robots , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[11]  Vesa T. Peltonen,et al.  Audio-based context recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Roberto Togneri,et al.  Spectrotemporal Analysis Using Local Binary Pattern Variants for Acoustic Scene Classification , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  I. Jolliffe Principal Component Analysis , 2002 .

[14]  Roberto Togneri,et al.  Enhanced LBP texture features from time frequency representations for acoustic scene classification , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  S. Essid,et al.  SUPERVISED NONNEGATIVE MATRIX FACTORIZATION FOR ACOUSTIC SCENE CLASSIFICATION , 2016 .

[16]  Sridhar Krishnan,et al.  Combining Temporal Features by Local Binary Pattern for Acoustic Scene Classification , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[18]  Gaël Richard,et al.  Feature Learning With Matrix Factorization Applied to Acoustic Scene Classification , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Luiz Eduardo Soares de Oliveira,et al.  Music genre classification using LBP textural features , 2012, Signal Process..

[20]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[21]  Loris Nanni,et al.  Combining visual and acoustic features for audio classification tasks , 2017, Pattern Recognit. Lett..

[22]  Zhu Liu,et al.  Multimedia content analysis-using both audio and visual clues , 2000, IEEE Signal Process. Mag..

[23]  Yandre Maldonado e Gomes da Costa,et al.  Acoustic scene classification using spectrograms , 2017, 2017 36th International Conference of the Chilean Computer Science Society (SCCC).

[24]  Chin-Chuan Han,et al.  Automatic recognition of animal vocalizations using averaged MFCC and linear discriminant analysis , 2006, Pattern Recognit. Lett..

[25]  Tuomas Virtanen,et al.  TUT database for acoustic scene classification and sound event detection , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[26]  Chee Kheong Siew,et al.  Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[27]  Mohan S. Kankanhalli,et al.  Audio Based Event Detection for Multimedia Surveillance , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[28]  Shrikanth Narayanan,et al.  Environmental Sound Recognition With Time–Frequency Audio Features , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[30]  Takumi Kobayashi,et al.  Acoustic feature extraction by statistics based local binary pattern for environmental sound classification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[32]  Mark D. Plumbley,et al.  Acoustic Scene Classification: Classifying environments from the sounds they produce , 2014, IEEE Signal Processing Magazine.

[33]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..