Analysis of spectrogram image methods for sound event classification

The time-frequency spectrogram representation of an audio signal can be visually analysed by a trained researcher to recognise any underlying sound events in a process called “spectrogram reading”. However, this has not become a popular approach for automatic classification, as the field is driven by Automatic Speech Recognition (ASR) where frame-based features are popular. As opposed to speech, sound events typically have a more distinctive time-frequency representation, with the energy concentrated in a small number of spectral components. This makes them more suitable for classification based on their visual signature, and enables inspiration to be found in techniques from the related field of image processing. Recently, there have been a range of techniques that extract image processing-inspired features from the spectrogram for sound event classification. In this paper, we introduce the idea and structure behind six recent spectrogram image methods and analyse their performance on a large database containing 50 different environmental sounds to give a standardised comparison that is not often available in sound event classification tasks.

[1]  Douglas D. O'Shaughnessy,et al.  Invited paper: Automatic speech recognition: History, methods and challenges , 2008, Pattern Recognit..

[2]  Shrikanth Narayanan,et al.  Environmental Sound Recognition With Time–Frequency Audio Features , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Satoshi Nakamura,et al.  Acoustical Sound Database in Real Environments for Sound Scene Understanding and Hands-Free Speech Recognition , 2000, LREC.

[4]  Chng Eng Siong,et al.  Image Feature Representation of the Subband Power Distribution for Robust Sound Event Classification , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Thomas S. Huang,et al.  Real-world acoustic event detection , 2010, Pattern Recognit. Lett..

[6]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[7]  Frank Kurth,et al.  Detecting bird sounds in a complex acoustic environment and application to bioacoustic monitoring , 2010, Pattern Recognit. Lett..

[8]  Tony Ezzat,et al.  Discriminative word-spotting using ordered spectro-temporal patch features , 2008, SAPA@INTERSPEECH.

[9]  Richard F. Lyon,et al.  Machine Hearing: An Emerging Field , 2010 .

[10]  Tetsuya Takiguchi,et al.  Gradient-based acoustic features for speech recognition , 2009, 2009 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS).

[11]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[12]  Chloé Clavel,et al.  Events Detection for an Audio-Based Surveillance System , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[13]  Sridhar Krishnan,et al.  Time–Frequency Matrix Feature Extraction and Classification of Environmental Audio Signals , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Haizhou Li,et al.  Spectrogram Image Feature for Sound Event Classification in Mismatched Conditions , 2011, IEEE Signal Processing Letters.

[15]  Bhiksha Raj,et al.  Spectrographic seam patterns for discriminative word spotting , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Kamil Behun Image features in music style recognition , 2012 .

[17]  Jean-Jacques E. Slotine,et al.  Audio classification from time-frequency texture , 2008, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Andrey Temko,et al.  ACOUSTIC EVENT DETECTION AND CLASSIFICATION IN SMART-ROOM ENVIRONMENTS: EVALUATION OF CHIL PROJECT SYSTEMS , 2006 .

[19]  Augusto Sarti,et al.  Scream and gunshot detection in noisy environments , 2007, 2007 15th European Signal Processing Conference.