Detection and Separation of Speech Event Using Audio and Video Information Fusion and Its Application to Robust Speech Interface

A method of detecting speech events in a multiple-sound-source condition using audio and video information is proposed. For detecting speech events, sound localization using a microphone array and human tracking by stereo vision is combined by a Bayesian network. From the inference results of the Bayesian network, information on the time and location of speech events can be known. The information on the detected speech events is then utilized in the robust speech interface. A maximum likelihood adaptive beamformer is employed as a preprocessor of the speech recognizer to separate the speech signal from environmental noise. The coefficients of the beamformer are kept updated based on the information of the speech events. The information on the speech events is also used by the speech recognizer for extracting the speech segment.

[1]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[2]  Finn V. Jensen,et al.  Bayesian Networks and Decision Graphs , 2001, Statistics for Engineering and Information Science.

[3]  Andrew D. Christian,et al.  Digital smart kiosk project , 1998, CHI.

[4]  Satoshi Nakamura,et al.  Real time face detection for multimodal speech recognition , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[5]  R. O. Schmidt,et al.  Multiple emitter location and signal Parameter estimation , 1986 .

[6]  Jae S. Lim,et al.  Speech enhancement , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Robert C. Bolles,et al.  Background modeling for segmentation of video-rate stereo sequences , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[8]  Thomas Kailath,et al.  Detection of signals by information theoretic criteria , 1985, IEEE Trans. Acoust. Speech Signal Process..

[9]  Nobuaki Minematsu,et al.  IPA Japanese Dictation Free Software Project , 2000, LREC.

[10]  Vladimir Pavlovic,et al.  Boosted learning in dynamic Bayesian networks for multimodal detection , 2002, Proceedings of the Fifth International Conference on Information Fusion. FUSION 2002. (IEEE Cat.No.02EX5997).

[11]  Futoshi Asano,et al.  Fusion of audio and video information for detecting speech events , 2003, Sixth International Conference of Information Fusion, 2003. Proceedings of the.

[12]  Satoshi Nakamura,et al.  Detection and separation of speech segment using audio and video information fusion , 2003, INTERSPEECH.

[13]  Takeshi Yamada,et al.  Estimation of the number of sound sources using support vector machines and its application to sound source separation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[14]  Christophe Beaugeant,et al.  METHODOLOGY FOR THE DESIGN OF A ROBUST VOICE ACTIVITY DETECTOR FOR SPEECH ENHANCEMENT , 2003 .

[15]  Satoshi Nakamura,et al.  DETECTION OF SPEECH EVENTS IN REAL ENVIRONMENTS THROUGH FUSION OF AUDIO AND VIDEO INFORMATION USING BAYESIAN NETWORKS , 2003 .

[16]  Peter Beyerlein,et al.  Speaker adaptation in the Philips system for large vocabulary continuous speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Yasuo Ariki,et al.  Unsupervised acoustic model adaptation based on phoneme error minimization , 2002, INTERSPEECH.

[18]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[19]  Satoshi Nakamura,et al.  Speech enhancement based on the subspace method , 2000, IEEE Trans. Speech Audio Process..