论文信息 - Audio-visual speech recognition with background music using single-channel source separation

Audio-visual speech recognition with background music using single-channel source separation

In this paper, we consider audio-visual speech recognition with background music. The proposed algorithm is an integration of audio-visual speech recognition and single channel source separation (SCSS). We apply the proposed algorithm to recognize spoken speech that is mixed with music signals. First, the SCSS algorithm based on nonnegative matrix factorization (NMF) and spectral masks is used to separate the audio speech signal from the background music in magnitude spectral domain. After speech audio is separated from music, regular audio-visual speech recognition (AVSR) is employed using multi-stream hidden Markov models. Employing two approaches together, we try to improve recognition accuracy by both processing the audio signal with SCSS and supporting the recognition task with visual information. Experimental results show that combining audio-visual speech recognition with source separation gives remarkable improvements in the accuracy of the speech recognition system.

Hakan Erdogan | Ibrahim Saygin Topkaya | Emad M. Grais

[1] Kevin P. Murphy,et al. Dynamic Bayesian Networks for Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..

[2] H. Sebastian Seung,et al. Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[3] Mikkel N. Schmidt,et al. Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.

[4] Juergen Luettin,et al. Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[5] Luc Vandendorpe,et al. The M2VTS Multimodal Face Database (Release 1.00) , 1997, AVBPA.

[6] C. Taylor,et al. Active shape models - 'Smart Snakes'. , 1992 .

[7] L. Rabiner,et al. An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[8] Juergen Luettin,et al. Audio-Visual Speech Modelling for Continuous Speech Recognition , 2000 .

[9] Fred Nicolls,et al. Locating Facial Features with an Extended Active Shape Model , 2008, ECCV.

[10] Emad M. Grais,et al. Single channel speech music separation using nonnegative matrix factorization and spectral masks , 2011, 2011 17th International Conference on Digital Signal Processing (DSP).

[11] P. Mermelstein,et al. Distance measures for speech recognition, psychological and instrumental , 1976 .

[12] Pierre Vandergheynst,et al. Blind Audiovisual Source Separation Based on Sparse Redundant Representations , 2010, IEEE Transactions on Multimedia.