Cascading appearance-based features for visual voice activity detection

The detection of voice activity is a challenging problem, especially when the level of acoustic noise is high. Most current approaches only utilise the audio signal, making them susceptible to acoustic noise. An obvious approach to overcome this is to use the visual modality. The current state-of-the-art visual feature extraction technique is one that uses a cascade of visual features (i.e. 2D-DCT, feature mean normalisation, interstep LDA). In this paper, we investigate the effectiveness of this technique for the task of visual voice activity detection (VAD), and analyse each stage of the cascade and quantify the relative improvement in performance gained by each successive stage. The experiments were conducted on the CUAVE database and our results highlight that the dynamics of the visual modality can be used to good effect to improve visual voice activity detection performance.

[1]  I. Boyd,et al.  The voice activity detector for the Pan-European digital cellular mobile telephone service , 1988, International Conference on Acoustics, Speech, and Signal Processing,.

[2]  Giridharan Iyengar,et al.  A cascade image transform for speaker independent automatic speechreading , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[3]  Maurizio Omologo,et al.  Use of a CSP-based voice activity detector for distant-talking ASR , 2003, INTERSPEECH.

[4]  A. Kondoz,et al.  Analysis and improvement of a statistical model-based voice activity detector , 2001, IEEE Signal Processing Letters.

[5]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[6]  Christian Jutten,et al.  An Analysis of Visual Speech Information Applied to Voice Activity Detection , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[7]  Chalapathy Neti,et al.  Audio-visual speech recognition in challenging environments , 2003, INTERSPEECH.

[8]  Wei Zhang,et al.  A soft voice activity detector based on a Laplacian-Gaussian model , 2003, IEEE Trans. Speech Audio Process..

[9]  Peng Liu,et al.  Voice activity detection using visual information , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[11]  Gerasimos Potamianos,et al.  An Embedded System for In-Vehicle Visual Speech Activity Detection , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[12]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[13]  H.S. Jamadagni,et al.  VAD techniques for real-time speech transmission on the Internet , 2002, 5th IEEE International Conference on High Speed Networks and Multimedia Communication (Cat. No.02EX612).

[14]  Gerasimos Potamianos,et al.  Lipreading Using Profile Versus Frontal Views , 2006, 2006 IEEE Workshop on Multimedia Signal Processing.

[15]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.