Design and Optimization of a Speech Recognition Front-End for Distant-Talking Control of a Music Playback Device

This paper addresses the challenging scenario for the distant-talking control of a music playback device, a common portable speaker with four small loudspeakers in close proximity to one microphone. The user controls the device through voice, where the speech-to-music ratio can be as low as -30 dB during music playback. We propose a speech enhancement front-end that relies on known robust methods for echo cancellation, double-talk detection, and noise suppression, as well as a novel adaptive quasi-binary mask that is well suited for speech recognition. The optimization of the system is then formulated as a large scale nonlinear programming problem where the recognition rate is maximized and the optimal values for the system parameters are found through a genetic algorithm. We validate our methodology by testing over the TIMIT database for different music playback levels and noise types. Finally, we show that the proposed front-end allows a natural interaction with the device for limited-vocabulary voice commands.

[1]  Stefan Goetze,et al.  RESIDUAL ECHO POWER SPECTRAL DENSITY ESTIMATION BASED ON AN OPTIMAL SMOOTHED MISALIGNMENT FOR ACOUSTIC ECHO CANCELATION , 2005 .

[2]  Jing Huang,et al.  Effective acoustic adaptation for a distant-talking interactive TV system , 2008, INTERSPEECH.

[3]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[4]  Satoshi Nakamura,et al.  Joint optimization of LCMV beamforming and acoustic echo cancellation for automatic speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[5]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[6]  Ted S. Wada,et al.  Enhancement of Residual Echo for Robust Acoustic Echo Cancellation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[8]  DeLiang Wang,et al.  A Direct Masking Approach to Robust ASR , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Steve Young,et al.  The HTK book , 1995 .

[10]  Rainer Martin,et al.  Unbiased residual echo power estimation for hands-free telephony , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Juan Manuel Górriz,et al.  Voice Activity Detection. Fundamentals and Speech Recognition System Robustness , 2007 .

[12]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[13]  J.-S. Soo,et al.  Multidelay block frequency domain adaptive filter , 1990, IEEE Trans. Acoust. Speech Signal Process..

[14]  Daniele Giacobello,et al.  Tuning methodology for speech enhancement algorithms using a simulated conversational database and perceptual objective measures , 2014, 2014 4th Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA).

[15]  Richard M. Stern,et al.  Microphone array processing for robust speech recognition , 2003 .

[16]  Yifan Gong,et al.  An Overview of Noise-Robust Automatic Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Dennis E. Egan,et al.  Handbook of Human Computer Interaction , 1988 .

[18]  John McDonough,et al.  Distant Speech Recognition , 2009 .

[19]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[20]  Israel Cohen,et al.  Joint noise reduction and acoustic echo cancellation using the transfer-function generalized sidelobe canceller , 2007, Speech Commun..

[21]  Carla Lopes,et al.  Phone Recognition on the TIMIT Database , 2012 .

[22]  J. Shynk Frequency-domain and multirate adaptive filtering , 1992, IEEE Signal Processing Magazine.

[23]  David G. Stork,et al.  Pattern Classification , 1973 .

[24]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[25]  Daniele Giacobello,et al.  Results on Automated Tuning of a Voice Quality Enhancement System Using Objective Quality Measures , 2013 .

[26]  Richard C. Hendriks,et al.  Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  W. Kellermann,et al.  A natural acoustic front-end for Interactive TV in the EU-Project DICIT , 2009, 2009 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing.

[28]  Ted S. Wada,et al.  Acoustic echo cancellation based on independent component analysis and integrated residual echo enhancement , 2009, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[29]  Richard M. Stern,et al.  Reconstruction of missing features for robust speech recognition , 2004, Speech Commun..

[30]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[31]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[32]  Ivan Tashev Coherence based double talk detector with soft decision , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Ted S. Wada,et al.  A system approach to acoustic echo cancellation in robust hands-free teleconferencing , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).