Discrimination of speech and non-linguistic vocalizations by Non-Negative Matrix Factorization

We introduce features based on Non-Negative Matrix Factorization (NMF) for discrimination of speech and non-linguistic vocalizations such as laughter or breathing, which is a crucial task in recognition of spontaneous speech. NMF has been successfully used in speech-related tasks such as de-noising and speaker separation. While existing approaches use it as a preprocessing step for conventional speech recognizers, we aim at directly classifying the output of the NMF algorithm. To this end, we propose a feature extraction procedure based on a supervised variant of NMF, considering two different algorithms. Applying our approach to a spontaneous speech corpus, we show that addition of NMF features to an MFCC-based classifier increases mean recall of speech and non-linguistic vocalizations by over 2.5% absolute, and particularly recall of laughter by 6.6% absolute. The improvement is significant at a level of 0.4 %.

[1]  Masataka Goto,et al.  A real-time filled pause detection system for spontaneous speech recognition , 1999, EUROSPEECH.

[2]  Paris Smaragdis,et al.  Mitsubishi Electric Research Laboratories , 1994 .

[3]  Loïc Kessous,et al.  The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals , 2007, INTERSPEECH.

[4]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Tuomas Virtanen,et al.  Spectral covariance in prior distributions of non-negative matrix factorization based speech separation , 2009, 2009 17th European Signal Processing Conference.

[6]  Ali Taylan Cemgil,et al.  Mixtures of Gamma Priors for Non-negative Matrix Factorization Based Speech Separation , 2009, ICA.

[7]  Seungjin Choi,et al.  Non-negative component parts of sound for classification , 2003, Proceedings of the 3rd IEEE International Symposium on Signal Processing and Information Technology (IEEE Cat. No.03EX795).

[8]  John R. Hershey,et al.  Efficient model-based speech separation and denoising using non-negative subspace analysis , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Björn W. Schuller,et al.  Static and Dynamic Modelling for the Recognition of Non-verbal Vocalisations in Conversational Speech , 2008, PIT.

[10]  David A. van Leeuwen,et al.  Automatic detection of laughter , 2005, INTERSPEECH.

[11]  Bhiksha Raj,et al.  Speech denoising using nonnegative matrix factorization with priors , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Björn Schuller,et al.  Being bored? Recognising natural interest by extensive audiovisual integration for real-life application , 2009, Image Vis. Comput..

[13]  Nikki Mirghafori,et al.  Automatic laughter detection using neural networks , 2007, INTERSPEECH.

[14]  Nick Campbell,et al.  On the Use of NonVerbal Speech Sounds in Human Communication , 2007, COST 2102 Workshop.

[15]  Bhiksha Raj,et al.  Regularized non-negative matrix factorization with temporal dependencies for speech denoising , 2008, INTERSPEECH.

[16]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[17]  Mikkel N. Schmidt,et al.  Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.