Non-negative matrix factorization based compensation of music for automatic speech recognition

This paper proposes to use non-negative matrix factorization based speech enhancement in robust automatic recognition of mixtures of speech and music. We represent magnitude spectra of noisy speech signals as the non-negative weighted linear combination of speech and noise spectral basis vectors, that are obtained from training corpora of speech and music. We use overcomplete dictionaries consisting of random exemplars of the training data. The method is tested on theWall Street Journal large vocabulary speech corpus which is artificially corrupted with polyphonic music from the RWC music database. Various music styles and speech-tomusic ratios are evaluated. The proposed methods are shown to produce a consistent, significant improvement on the recognition performance in the comparison with the baseline method. Audio demonstrations of the enhanced signals are available at http://www.cs.tut.fi/ tuomasv.

[1]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[2]  Richard M. Stern,et al.  On tracking noise with linear dynamical system models , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Alex Acero,et al.  Noise robust speech recognition with a switching linear dynamic model , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Yifan Gong,et al.  A minimum-mean-square-error noise reduction algorithm on Mel-frequency cepstra for robust speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Andrzej Cichocki,et al.  Non-Negative Matrix Factorization , 2020 .

[7]  Bhiksha Raj,et al.  Probabilistic Latent Variable Models as Nonnegative Factorizations , 2008, Comput. Intell. Neurosci..

[8]  Masataka Goto,et al.  Development of the RWC Music Database , 2004 .

[9]  Bhiksha Raj,et al.  A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds , 2009, NIPS.

[10]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[11]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[12]  John R. Hershey,et al.  Super-human multi-talker speech recognition: A graphical modeling approach , 2010, Comput. Speech Lang..

[13]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[14]  Tuomas Virtanen,et al.  Noise robust exemplar-based connected digit recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.