论文信息 - Design and Optimization of a Speech Recognition Front-End for Distant-Talking Control of a Music Playback Device

Design and Optimization of a Speech Recognition Front-End for Distant-Talking Control of a Music Playback Device

This paper addresses the challenging scenario for the distant-talking control of a music playback device, a common portable speaker with four small loudspeakers in close proximity to one microphone. The user controls the device through voice, where the speech-to-music ratio can be as low as -30 dB during music playback. We propose a speech enhancement front-end that relies on known robust methods for echo cancellation, double-talk detection, and noise suppression, as well as a novel adaptive quasi-binary mask that is well suited for speech recognition. The optimization of the system is then formulated as a large scale nonlinear programming problem where the recognition rate is maximized and the optimal values for the system parameters are found through a genetic algorithm. We validate our methodology by testing over the TIMIT database for different music playback levels and noise types. Finally, we show that the proposed front-end allows a natural interaction with the device for limited-vocabulary voice commands.

[1] Stefan Goetze,et al. RESIDUAL ECHO POWER SPECTRAL DENSITY ESTIMATION BASED ON AN OPTIMAL SMOOTHED MISALIGNMENT FOR ACOUSTIC ECHO CANCELATION , 2005 .

[2] Jing Huang,et al. Effective acoustic adaptation for a distant-talking interactive TV system , 2008, INTERSPEECH.

[3] Wonyong Sung,et al. A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[4] Satoshi Nakamura,et al. Joint optimization of LCMV beamforming and acoustic echo cancellation for automatic speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[5] Carla Teixeira Lopes,et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[6] Ted S. Wada,et al. Enhancement of Residual Echo for Robust Acoustic Echo Cancellation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[7] Shigeo Abe DrEng. Pattern Classification , 2001, Springer London.

[8] DeLiang Wang,et al. A Direct Masking Approach to Robust ASR , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[9] Steve Young,et al. The HTK book , 1995 .

[10] Rainer Martin,et al. Unbiased residual echo power estimation for hands-free telephony , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11] Juan Manuel Górriz,et al. Voice Activity Detection. Fundamentals and Speech Recognition System Robustness , 2007 .

[12] David Malah,et al. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[13] J.-S. Soo,et al. Multidelay block frequency domain adaptive filter , 1990, IEEE Trans. Acoust. Speech Signal Process..

[14] Daniele Giacobello,et al. Tuning methodology for speech enhancement algorithms using a simulated conversational database and perceptual objective measures , 2014, 2014 4th Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA).

[15] Richard M. Stern,et al. Microphone array processing for robust speech recognition , 2003 .

[16] Yifan Gong,et al. An Overview of Noise-Robust Automatic Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17] Dennis E. Egan,et al. Handbook of Human Computer Interaction , 1988 .

[18] John McDonough,et al. Distant Speech Recognition , 2009 .

[19] Jonathan G. Fiscus,et al. Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[20] Israel Cohen,et al. Joint noise reduction and acoustic echo cancellation using the transfer-function generalized sidelobe canceller , 2007, Speech Commun..

[21] Carla Lopes,et al. Phone Recognition on the TIMIT Database , 2012 .

[22] J. Shynk. Frequency-domain and multirate adaptive filtering , 1992, IEEE Signal Processing Magazine.

[23] David G. Stork,et al. Pattern Classification , 1973 .

[24] Phil D. Green,et al. Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[25] Daniele Giacobello,et al. Results on Automated Tuning of a Voice Quality Enhancement System Using Objective Quality Measures , 2013 .

[26] Richard C. Hendriks,et al. Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[27] W. Kellermann,et al. A natural acoustic front-end for Interactive TV in the EU-Project DICIT , 2009, 2009 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing.

[28] Ted S. Wada,et al. Acoustic echo cancellation based on independent component analysis and integrated residual echo enhancement , 2009, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[29] Richard M. Stern,et al. Reconstruction of missing features for robust speech recognition , 2004, Speech Commun..

[30] David E. Goldberg,et al. Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[31] Ephraim. Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[32] Ivan Tashev. Coherence based double talk detector with soft decision , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33] Ted S. Wada,et al. A system approach to acoustic echo cancellation in robust hands-free teleconferencing , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).