In traditional voice activity detection (VAD) approaches, some features of the audio stream, for example frame-energy features, are used for voice decision. In this paper, we present the general framework of a visual information based VAD approach in a multi-modal system. Firstly, the Gauss mixture visual models of voice and non-voice are designed, and the decision rule is discussed in detail. Subsequently, the visual feature extraction method for VAD is investigated. The best visual feature structure and the best mixture number are selected experimentally. Our experiments show that using visual information based VAD, prominent reduction in frame error rate (31.1% relatively) is achieved, and the audio-visual stream can be segmented into sentences for recognition much more precisely (98.4% relative reduction in sentence break error rate), compared to the frame-energy based approach in the clean audio case. Furthermore, the performance of visual based VAD is independent of background noise.
[1]
Lawrence Sirovich,et al.
Application of the Karhunen-Loeve Procedure for the Characterization of Human Faces
,
1990,
IEEE Trans. Pattern Anal. Mach. Intell..
[2]
Jeih-Weih Hung,et al.
Robust entropy-based endpoint detection for speech recognition in noisy environments
,
1998,
ICSLP.
[3]
Aaron E. Rosenberg,et al.
An improved endpoint detector for isolated word recognition
,
1981
.
[4]
Ji Wu,et al.
Fuzzy clustering and Bayesian information criterion based threshold estimation for robust voice activity detection
,
2003,
2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..
[5]
John A. Nelder,et al.
A Simplex Method for Function Minimization
,
1965,
Comput. J..
[6]
S. Gökhun Tanyer,et al.
Voice activity detection in nonstationary noise
,
2000,
IEEE Trans. Speech Audio Process..