Robust detection of visual ROI for automatic speechreading

We present our work on visual pruning in an audio-visual (AV) speech recognition scenario. Visual speech information has been successfully used in circumstances where audio-only recognition suffers (e.g. noisy environments). Tracking and extraction of region-of-interest (ROI) (e.g., speaker's mouth region) from video is an essential component of such systems. It is important for the visual front-end to handle tracking errors that result in noisy visual data and hamper performance. We present our robust visual front-end, investigate methods to prune visual noise and its effect on the performance of the AV speech recognition systems. Specifically, we estimate the "goodness of ROI" using Gaussian mixture models and our experiments indicate that significant performance gains are achieved with good quality visual data.

[1]  Giridharan Iyengar,et al.  A cascade image transform for speaker independent automatic speechreading , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[2]  Jenq-Neng Hwang,et al.  Lipreading from color video , 1997, IEEE Trans. Image Process..

[3]  Ashish Verma,et al.  LATE INTEGRATION IN AUDIO-VISUAL CONTINUOUS SPEECH RECOGNITION , 1999 .

[4]  Tsuhan Chen,et al.  Audio-visual integration in multimodal communication , 1998, Proc. IEEE.

[5]  David G. Stork,et al.  Visionary Speech: Looking Ahead to Practical Speechreading Systems , 1996 .