Representing nonspeech audio signals through speech classification models

The human auditory system is very well matched to both human speech and environmental sounds. Therefore, the question arises whether human speech material may provide useful information for training systems for analyzing nonspeech audio signals, such as in a recognition task. To find out how similar nonspeech signals are to speech, we measure the closeness between target nonspeech signals and different basis speech categories via a speech classification model. The speech similarities are finally employed as a descriptor to represent the target signal. We further show that a better descriptor can be obtained by learning to organize the speech categories hierarchically with a tree structure. We conduct experiments for the audio event analysis application by using speech words from the TIMIT dataset to learn the descriptors for the audio events of the Freiburg-106 dataset. Our results on the event recognition task outperform those achieved by the best system even though a simple linear classifier is used. Furthermore, integrating the learned descriptors as an additional source leads to improved performance.

[1]  Taras Butko,et al.  Acoustic Event Detection Based on Feature-Level Fusion of Audio and Video Modalities , 2011, EURASIP J. Adv. Signal Process..

[2]  Albert S. Bregman,et al.  The Auditory Scene. (Book Reviews: Auditory Scene Analysis. The Perceptual Organization of Sound.) , 1990 .

[3]  Malcolm Slaney,et al.  Acoustic Chord Transcription and Key Extraction From Audio Using Key-Dependent HMMs Trained on Synthesized Audio , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[5]  Albino Nogueiras,et al.  Frequency and time filtering of filter-bank energies for HMM speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[6]  Shrikanth Narayanan,et al.  Environmental Sound Recognition With Time–Frequency Audio Features , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Huy Phan,et al.  Random Regression Forests for Acoustic Event Detection and Classification , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Murat Akbacak,et al.  Softening quantization in bag-of-audio-words , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Alessia Saggese,et al.  Audio surveillance using a bag of aural words classifier , 2013, 2013 10th IEEE International Conference on Advanced Video and Signal Based Surveillance.

[10]  Kai Oliver Arras,et al.  Audio-based human activity recognition using Non-Markovian Ensemble Voting , 2012, 2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication.

[11]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[12]  Jason Weston,et al.  Label Embedding Trees for Large Multi-Class Tasks , 2010, NIPS.

[13]  Richard F. Lyon,et al.  Machine Hearing: An Emerging Field , 2010 .

[14]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Yann LeCun,et al.  Moving Beyond Feature Design: Deep Architectures and Automatic Feature Learning in Music Informatics , 2012, ISMIR.

[16]  Gernot A. Fink,et al.  A Bag-of-Features approach to acoustic event detection , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Chng Eng Siong,et al.  Image Feature Representation of the Subband Power Distribution for Robust Sound Event Classification , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  G. Kramer Auditory Scene Analysis: The Perceptual Organization of Sound by Albert Bregman (review) , 2016 .

[19]  Lyle Campbell,et al.  Ethnologue: Languages of the world (review) , 2008 .

[20]  Ngoc Thang Vu,et al.  GlobalPhone: A multilingual text & speech database in 20 languages , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[22]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[23]  Murat Akbacak,et al.  Bag-of-Audio-Words Approach for Multimedia Event Classification , 2012, INTERSPEECH.

[24]  Huy Phan,et al.  Exploring superframe co-occurrence for acoustic event recognition , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[25]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[26]  P. Lewis Ethnologue : languages of the world , 2009 .

[27]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[28]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[29]  Petros Maragos,et al.  Multi-microphone fusion for detection of speech and acoustic events in smart spaces , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[30]  Climent Nadeu,et al.  Time and frequency filtering of filter-bank energies for robust HMM speech recognition , 2000, Speech Commun..

[31]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.