Barge-in robust spoken dialogue interface using multichannel sound field control and array signal processing

A spoken dialogue system is demanded as a user-friendly human-machine interface that does not require any special skills in its manipulation. Speech has advantageous features: they are hands-free and eyes-free, i.e., one can use speech while doing other tasks. For effective utilization of the features, it is desirable that the system can be used even when the user stands away from the microphone or the user’s speech is uttered interrupting the output sound of the system (response sound). The problem in satisfying such demands is the degradation of automatic speech recognition (ASR) because of feedback of response sound and observation of interfering noise due to other sound than the user’s speech. Since current ASR systems are sensitive to noise, a noise reduction method is indispensable. In elimination of the response sound and the interfering noise, an acoustic echo canceller (AEC) and an adaptive beamformer (ABF) are generally used, respectively. In each of the methods, a filter is adapted to eliminate its target noise based on the minimum-mean-squared-error criterion. Thus, when their filters are trained using signals containing sources other than their target noise, their performances degrade severely. To prevent such degradation, the system should detect the times when the observed signals contain sounds other than the target noise, denoted as double-talk detection (DTD). However, accurate DTD is difficult, particularly in such a situation that both response sound and interfering ∗Doctoral Dissertation, Department of Information Processing, Graduate School of Information Science, Nara Institute of Science and Technology, NAIST-IS-DD0561031, September 30, 2007.

[1]  Victor Zue,et al.  GALAXY-II: a reference architecture for conversational system development , 1998, ICSLP.

[2]  Kiyohiro Shikano,et al.  Blind Source Separation Combining Independent Component Analysis and Beamforming , 2003, EURASIP J. Adv. Signal Process..

[3]  Kiyohiro Shikano,et al.  Interface for Barge-in Free Spoken Dialogue System Using Nullspace Based Sound Field Control and Beamforming , 2006, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[4]  Saruwatari Hiroshi,et al.  Spoken Dialogue Interface Using Sound Field Control and Source Separation , 2006 .

[5]  Kiyohiro Shikano,et al.  Barge-in- and noise-free spoken dialogue interface based on sound field control and semi-blind source separation , 2007, 2007 15th European Signal Processing Conference.

[6]  Takashi Araseki,et al.  Echo Canceler with Two Echo Path Models , 1977, IEEE Trans. Commun..

[7]  Masato Miyoshi,et al.  Inverse filtering of room acoustics , 1988, IEEE Trans. Acoust. Speech Signal Process..

[8]  Kenji Sugimoto,et al.  An ICA approach to semi‐blind identification of strictly proper systems based on interactor polynomial matrix , 2007 .

[9]  Kiyohiro Shikano,et al.  A new phonetic tied-mixture model for efficient decoding , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[10]  Kiyohiro Shikano,et al.  Interface for Barge-in Free Spoken Dialogue System Based on Sound Field Reproduction and Microphone Array , 2007, EURASIP J. Adv. Signal Process..

[11]  K. Shikano,et al.  Speech enhancement using nullspace-based sound field control for barge-in free spoken dialogue interface , 2005, IEEE/SP 13th Workshop on Statistical Signal Processing, 2005.

[12]  Rüdiger Hoffmann,et al.  Toward spontaneous speech Synthesis-utilizing language model information in TTS , 2004, IEEE Transactions on Speech and Audio Processing.

[13]  Kiyohiro Shikano,et al.  Insights Gained from Development and Long-Term Operation of a Real-Environment Speech-Oriented Guidance System , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[14]  F. Asano,et al.  An optimum computer‐generated pulse signal suitable for the measurement of very long impulse responses , 1995 .

[15]  Young-Cheol Park,et al.  A new adaptive algorithm for stereophonic acoustic echo canceller , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[16]  B. Widrow,et al.  Adaptive noise cancelling: Principles and applications , 1975 .

[17]  Hua Ye,et al.  A new double-talk detection algorithm based on the orthogonality theorem , 1991, IEEE Trans. Commun..

[18]  Shoko Araki,et al.  The fundamental limitation of frequency domain blind source separation for convolutive mixtures of speech , 2003, IEEE Trans. Speech Audio Process..

[19]  Juro Ohga,et al.  Adaptive microphone-array system for noise reduction , 1986, IEEE Trans. Acoust. Speech Signal Process..

[20]  Satoshi Nakamura,et al.  Joint optimization of LCMV beamforming and acoustic echo cancellation for automatic speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[21]  Kiyohiro Shikano,et al.  An Iterative Inverse Filter Design Method for the Multichannel Sound Field Reproduction System , 2001 .

[22]  Walter Kellermann,et al.  A generalization of blind source separation algorithms for convolutive mixtures based on second-order statistics , 2005, IEEE Transactions on Speech and Audio Processing.

[23]  O. L. Frost,et al.  An algorithm for linearly constrained adaptive array processing , 1972 .

[24]  K. Shikano,et al.  Blind Source Separation of Acoustic Signals Based on Multistage ICA Combining Frequency-Domain ICA and Time-Domain ICA , 2003, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[25]  Hiroshi Sawada,et al.  A robust and precise method for solving the permutation problem of frequency-domain blind source separation , 2004, IEEE Transactions on Speech and Audio Processing.

[26]  Shiro Ikeda,et al.  A METHOD OF ICA IN TIME-FREQUENCY DOMAIN , 2003 .

[27]  Sang-Hoon Oh,et al.  A filter bank approach to independent component analysis and its application to adaptive noise cancelling , 2003, Neurocomputing.

[28]  G.S. Moschytz,et al.  Combined blind/nonblind source separation based on the natural gradient , 2001, IEEE Signal Processing Letters.

[29]  Meetings , 1891, Bristol Medico-Chirurgical Journal (1883).

[30]  Kiyohiro Shikano,et al.  Unsupervised speaker adaptation based on HMM sufficient statistics in various noisy environments , 2003, INTERSPEECH.

[31]  Stephanie Seneff,et al.  Dialogue Management in the Mercury Flight Reservation System , 2000 .

[32]  M. M. Sondhi,et al.  An adaptive echo canceller , 1967 .

[33]  Kiyohiro Shikano,et al.  Barge-in Free Spoken Dialogue Interface Based on Response Sound Cancellation Using Sound field Control and Microphone Array , 2005 .

[34]  Walter Kellermann,et al.  Wave-domain adaptive filtering: acoustic echo cancellation for full-duplex systems based on wave-field synthesis , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[35]  Kiyohiro Shikano,et al.  Multi-Channel Inverse Filtering with Loudspeaker Selection and Enhancement for Robust Sound Field Reproduction , 2006 .

[36]  Biing-Hwang Juang,et al.  Hands-free telecommunications , 2001 .

[37]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[38]  Shoji Makino Stereophonic acoustic echo cancellation: An overview and recent solutions , 2001 .

[39]  Andrzej Cichocki,et al.  A New Learning Algorithm for Blind Signal Separation , 1995, NIPS.

[40]  Kiyohiro Shikano,et al.  Blind source separation based on a fast-convergence algorithm combining ICA and beamforming , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  Walter Kellermann,et al.  An Acoustic Human-Machine Front-End for Multimedia Applications , 2003, EURASIP J. Adv. Signal Process..

[42]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[43]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[44]  K. Shikano,et al.  Sound Reproduction System Including Adaptive Compensation of Temperature Fluctuation Effect for Broad-Band Sound Control , 2002, IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences.

[45]  Kiyohiro Shikano,et al.  Efficient Blind Source Separation Combining Closed-Form Second-Order ICA and Nonclosed-Form Higher-Order ICA , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[46]  J. Flanagan,et al.  Computer‐steered microphone arrays for sound transduction in large rooms , 1985 .

[47]  Shuichi Itahashi,et al.  The design of the newspaper-based Japanese large vocabulary continuous speech recognition corpus , 1998, ICSLP.

[48]  Shuichi Itahashi,et al.  JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research , 1999 .

[49]  Jacob Benesty,et al.  A frequency-domain double-talk detector based on a normalized cross-correlation vector , 2001, Signal Process..

[50]  Kiyohiro Shikano,et al.  Barge-in free spoken dialogue interface using nullspace-based sound field control and beamforming , 2005, 2005 13th European Signal Processing Conference.

[51]  Jacob Benesty,et al.  Generalized multichannel frequency-domain adaptive filtering: efficient realization and application to hands-free speech communication , 2005, Signal Process..

[52]  Eberhard Hänsler Acoustic echo and noise control: where do we come from - where do we go to? , 2001 .

[53]  Kiyohiro Shikano,et al.  Double-Talk Free Spoken Dialogue Interface Combining Sound Field Control With Semi-Blind Source Separation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[54]  Simon Haykin,et al.  Adaptive Filter Theory 4th Edition , 2002 .

[55]  Jacob Benesty,et al.  The fast normalized cross-correlation double-talk detector , 2006, Signal Process..

[56]  L. Fratta,et al.  List of Publications * Journal Papers , 2008 .

[57]  Victor Zue,et al.  JUPlTER: a telephone-based conversational interface for weather information , 2000, IEEE Trans. Speech Audio Process..

[58]  L. J. Griffiths,et al.  An alternative approach to linearly constrained adaptive beamforming , 1982 .

[59]  Kiyohiro Shikano,et al.  Interface for barge-in free spoken dialogue system using adaptive sound field control , 2004, INTERSPEECH.

[60]  Saruwatari Hiroshi,et al.  Multichannel Audio Signal Compressive Coding Method with Independent Component Analysis , 2006 .

[61]  Lucas C. Parra,et al.  Convolutive blind separation of non-stationary sources , 2000, IEEE Trans. Speech Audio Process..

[62]  Hiroshi Sawada,et al.  Polar coordinate based nonlinear function for frequency-domain blind source separation , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[63]  Walter Kellermann,et al.  Frequency-domain integration of acoustic echo cancellation and a generalized sidelobe canceller with improved robustness , 2002, Eur. Trans. Telecommun..

[64]  K. Shikano,et al.  MLSP 2007 Data Analysis Competition: Two-Stage Blind Source Separation Combining SIMO-Model-Based ICA and Binary Masking , 2007, 2007 IEEE Workshop on Machine Learning for Signal Processing.

[65]  Gerhard Schmidt,et al.  Acoustic echo control. An application of very-high-order adaptive filters , 1999, IEEE Signal Process. Mag..

[66]  Steve Rogers,et al.  Adaptive Filter Theory , 1996 .

[67]  L. Ljung,et al.  Fast calculation of gain matrices for recursive estimation schemes , 1978 .

[68]  Walter Kellermann,et al.  Kompensation akustischer Echos in Frequenzteilbändern , 1985 .

[69]  Jerry Bauck,et al.  Generalized transaural stereo and applications , 1996 .

[70]  Kiyohiro Shikano,et al.  Minimum Error Relaxation Algorithm of Inverse Filter in Multi-Channel Sound Reproduction System , 2006 .

[71]  M.M. Sondhi,et al.  Silencing echoes on the telephone network , 1980, Proceedings of the IEEE.

[72]  Paris Smaragdis,et al.  Blind separation of convolved mixtures in the frequency domain , 1998, Neurocomputing.

[73]  Chen Zhe,et al.  An Echo Canceller Based on the Structure of Dual-auxiliary Filters , 2003 .

[74]  Kiyohiro Shikano,et al.  Julius - an open source real-time large vocabulary recognition engine , 2001, INTERSPEECH.