Sound-Event Classification Using Robust Texture Features for Robot Hearing

Sound-event classification often utilizes time–frequency analysis, which produces an image-like spectrogram. Recent approaches such as spectrogram image features and subband power distribution image features extract the image local statistics such as mean and variance from the spectrogram. They have demonstrated good performance. However, we argue that such simple image statistics cannot well capture the complex texture details of the spectrogram. Thus, we propose to extract the local binary pattern (LBP) from the logarithm of the Gammatone-like spectrogram. However, the LBP feature is sensitive to noise. After analyzing the spectrograms of sound events and the audio noise, we find that the magnitude of pixel differences, which is discarded by the LBP feature, carries important information for sound-event classification. We thus propose a multichannel LBP feature via pixel difference quantization to improve the robustness to the audio noise. In view of the differences between spectrograms and natural images, and the reliability issues of LBP features, we propose two projection-based LBP features to better capture the texture information of the spectrogram. To validate the proposed multichannel projection-based LBP features for robot hearing, we have built a new sound-event classification database, the NTU-SEC database, in the context of social interaction between human and robot. It is publicly available to promote research on sound-event classification in a social context. The proposed approaches are compared with the state of the art on the RWCP database and the NTU-SEC database. They consistently demonstrate superior performance under various noise conditions.

[1]  P. Mounika,et al.  Noise-Resistant Local Binary Pattern with an Embedded Error-Correction Mechanism , 2017 .

[2]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[3]  Yuchun Fang,et al.  Improving LBP features for gender classification , 2008, 2008 International Conference on Wavelet Analysis and Pattern Recognition.

[4]  Xin Yu,et al.  Object Tracking With Multi-View Support Vector Machines , 2015, IEEE Transactions on Multimedia.

[5]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[6]  Aggelos K. Katsaggelos,et al.  Variational Bayesian Methods For Multimedia Problems , 2014, IEEE Transactions on Multimedia.

[7]  Myung Jong Kim,et al.  Audio-Based Objectionable Content Detection Using Discriminative Transforms of Time-Frequency Dynamics , 2012, IEEE Transactions on Multimedia.

[8]  Chng Eng Siong,et al.  Image Feature Representation of the Subband Power Distribution for Robust Sound Event Classification , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Masataka Goto,et al.  Gradient-based musical feature extraction based on scale-invariant feature transform , 2011, 2011 19th European Signal Processing Conference.

[10]  Andrey Temko,et al.  Classification of meeting-room acoustic events with support vector machines and variable-feature-set clustering , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[11]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Xudong Jiang,et al.  LBP-Based Edge-Texture Features for Object Recognition , 2014, IEEE Transactions on Image Processing.

[13]  Xudong Jiang,et al.  Linear Subspace Learning-Based Dimensionality Reduction , 2011, IEEE Signal Processing Magazine.

[14]  James M. Rehg,et al.  CENTRIST: A Visual Descriptor for Scene Categorization , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Xudong Jiang,et al.  LBP Encoding Schemes Jointly Utilizing the Information of Current Bit and Other LBP Bits , 2015, IEEE Signal Processing Letters.

[16]  Guodong Guo,et al.  Content-based audio classification and retrieval by support vector machines , 2003, IEEE Trans. Neural Networks.

[17]  Xudong Jiang,et al.  Sound-event classification using pseudo-color CENTRIST feature and classifier selection , 2016, International Workshop on Pattern Recognition.

[18]  Hrishikesh Deshpande,et al.  CLASSIFICATION OF MUSIC SIGNALS IN THE VISUAL DOMAIN , 2001 .

[19]  Satoshi Nakamura,et al.  Acoustical Sound Database in Real Environments for Sound Scene Understanding and Hands-Free Speech Recognition , 2000, LREC.

[20]  Xudong Jiang,et al.  Noise-Resistant Local Binary Pattern With an Embedded Error-Correction Mechanism , 2013, IEEE Transactions on Image Processing.

[21]  Xudong Jiang,et al.  Relaxed local ternary pattern for face recognition , 2013, 2013 IEEE International Conference on Image Processing.

[22]  Jian Sun,et al.  Blessing of Dimensionality: High-Dimensional Feature and Its Efficient Compression for Face Verification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Shrikanth Narayanan,et al.  Environmental Sound Recognition With Time–Frequency Audio Features , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  M. Kleinschmidt Methods for capturing spectro-temporal modulations in automatic speech recognition , 2001 .

[25]  Sridhar Krishnan,et al.  Time–Frequency Matrix Feature Extraction and Classification of Environmental Audio Signals , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Haizhou Li,et al.  Spectrogram Image Feature for Sound Event Classification in Mismatched Conditions , 2011, IEEE Signal Processing Letters.

[27]  Luiz Eduardo Soares de Oliveira,et al.  Music genre classification using LBP textural features , 2012, Signal Process..

[28]  Ghulam Muhammad,et al.  Environment Recognition from Audio Using MPEG-7 Features , 2009, 2009 Fourth International Conference on Embedded and Multimedia Computing.

[29]  Cong Geng,et al.  Face recognition based on the multi-scale local image structures , 2011, Pattern Recognit..

[30]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[31]  Shu Liao,et al.  Dominant Local Binary Patterns for Texture Classification , 2009, IEEE Transactions on Image Processing.

[32]  Xudong Jiang,et al.  Learning binarized pixel-difference pattern for scene recognition , 2013, 2013 IEEE International Conference on Image Processing.

[33]  Matti Pietikäinen,et al.  Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Haizhou Li,et al.  Sound Event Recognition With Probabilistic Distance SVMs , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Renate Sitte,et al.  Comparison of techniques for environmental sound recognition , 2003, Pattern Recognit. Lett..

[36]  Xudong Jiang,et al.  Dynamic texture recognition using enhanced LBP features , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[37]  Josef Kittler,et al.  Dynamic Texture Recognition Using Multiscale Binarized Statistical Image Features , 2014, IEEE Transactions on Multimedia.

[38]  Xudong Jiang,et al.  Learning LBP structure by maximizing the conditional mutual information , 2015, Pattern Recognit..

[39]  Jean-Jacques E. Slotine,et al.  Audio classification from time-frequency texture , 2008, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[40]  Gang Wang,et al.  Optimizing LBP Structure For Visual Recognition Using Binary Quadratic Programming , 2014, IEEE Signal Processing Letters.

[41]  Chang-Hong Lin,et al.  Gabor-Based Nonuniform Scale-Frequency Map for Environmental Sound Classification in Home Automation , 2014, IEEE Transactions on Automation Science and Engineering.

[42]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[43]  Xudong Jiang,et al.  Human Detection by Quadratic Classification on Subspace of Extended Histogram of Gradients , 2014, IEEE Transactions on Image Processing.

[44]  Jianxin Wu,et al.  mCENTRIST: A Multi-Channel Feature Generation Mechanism for Scene Categorization , 2014, IEEE Transactions on Image Processing.

[45]  Takumi Kobayashi,et al.  Acoustic feature extraction by statistics based local binary pattern for environmental sound classification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Shuicheng Yan,et al.  Exploring Feature Descritors for Face Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[47]  Xudong Jiang,et al.  A Chi-Squared-Transformed Subspace of LBP Histogram for Visual Recognition , 2015, IEEE Transactions on Image Processing.

[48]  Augusto Sarti,et al.  Scream and gunshot detection in noisy environments , 2007, 2007 15th European Signal Processing Conference.

[49]  S. Qian,et al.  Joint time-frequency analysis , 1999, IEEE Signal Process. Mag..

[50]  Zhenhua Guo,et al.  A Completed Modeling of Local Binary Pattern Operator for Texture Classification , 2010, IEEE Transactions on Image Processing.

[51]  Takumi Kobayashi,et al.  Kernel discriminant analysis for environmental sound recognition based on acoustic subspace , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[52]  Dan Stowell,et al.  Detection and Classification of Acoustic Scenes and Events , 2015, IEEE Transactions on Multimedia.

[53]  Faliang Chang,et al.  Automatic facial expression recognition using local binary pattern , 2010, 2010 8th World Congress on Intelligent Control and Automation.

[54]  Björn W. Schuller,et al.  Audio recognition in the wild: Static and dynamic classification on a real-world database of animal vocalizations , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55]  Xudong Jiang,et al.  Asymmetric Principal Component and Discriminant Analyses for Pattern Classification , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Ming Yang,et al.  Mining discriminative co-occurrence patterns for visual recognition , 2011, CVPR 2011.

[57]  Francesc Alías,et al.  Gammatone Cepstral Coefficients: Biologically Inspired Features for Non-Speech Audio Classification , 2012, IEEE Transactions on Multimedia.

[58]  I. Paraskevas,et al.  Audio classification using acoustic images for retrieval from multimedia databases , 2003, Proceedings EC-VIP-MC 2003. 4th EURASIP Conference focused on Video/Image Processing and Multimedia Communications (IEEE Cat. No.03EX667).

[59]  Richard F. Lyon,et al.  Machine Hearing: An Emerging Field [Exploratory DSP] , 2010, IEEE Signal Processing Magazine.

[60]  Birger Kollmeier,et al.  On the use of spectro-temporal features for the IEEE AASP challenge ‘detection and classification of acoustic scenes and events’ , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.