Hough Transform-based Mouth Localization for Audio-visual Speech Recognition

We present a novel method for mouth localization in the context of multimodal speech recognition where audio and visual cues are fused to improve the speech recognition accuracy. While facial feature points like mouth corners or lip contours are commonly used to estimate at least scale, position, and orientation of the mouth, we propose a Hough transform-based method. Instead of relying on a predefined sparse subset of mouth features, it casts probabilistic votes for the mouth center from several patches in the neighborhood and accumulates the votes in a Hough image. This makes the localization more robust as it does not rely on the detection of a single feature. In addition, we exploit the different shape properties of eyes and mouth in order to localize the mouth more efficiently. Using the rotation invariant representation of the iris, scale and orientation can be efficiently inferred from the localized eye positions. The superior accuracy of our method and quantitative improvements for audio-visual speech recognition over monomodal approaches are demonstrated on two datasets.

[1]  James M. Rehg,et al.  Asymmetrically boosted HMM for speech reading , 2004, CVPR 2004.

[2]  Gerasimos Potamianos,et al.  Exploiting lower face symmetry in appearance-based automatic speechreading , 2005, AVSP.

[3]  Jean-Philippe Thiran,et al.  Information Theoretic Feature Extraction for Audio-Visual Speech Recognition , 2009, IEEE Transactions on Signal Processing.

[4]  HighWire Press Philosophical Transactions of the Royal Society of London , 1781, The London Medical Journal.

[5]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[6]  Samuel Pachoud,et al.  Macro-cuboïd based probabilistic matching for lip-reading digits , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[8]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[9]  Steve Young,et al.  The HTK book , 1995 .

[10]  Stephen M. Omohundro,et al.  Nonlinear manifold learning for visual speech recognition , 1995, Proceedings of IEEE International Conference on Computer Vision.

[11]  Andrew Blake,et al.  Real-Time Lip Tracking for Audio-Visual Speech Recognition Applications , 1996, ECCV.

[12]  Ioannis Pitas,et al.  A Support Vector Machine-Based Dynamic Network for Visual Speech Recognition Applications , 2002, EURASIP J. Adv. Signal Process..

[13]  Martin Heckmann,et al.  A hybrid ANN/HMM audio-visual speech recognition system , 2001, AVSP.

[14]  Timothy F. Cootes,et al.  A Multi-Stage Approach to Facial Feature Detection , 2004, BMVC.

[15]  Theo Gevers,et al.  Accurate eye center location and tracking using isophote curvature , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Juergen Luettin,et al.  Speechreading using Probabilistic Models , 1997, Comput. Vis. Image Underst..

[17]  Stephen J. Cox,et al.  A Comparison of Active Shape Model and Scale Decomposition Based Features for Visual Speech Recognition , 1998, ECCV.

[18]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[19]  Paul A. Viola,et al.  Robust Real-time Object Detection , 2001 .

[20]  Maja Pantic,et al.  Fully automatic facial feature point detection using Gabor feature based boosted classifiers , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[21]  Juergen Gall,et al.  Class-specific Hough forests for object detection , 2009, CVPR.

[22]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[23]  Q. Summerfield,et al.  Lipreading and audio-visual speech perception. , 1992, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[24]  Horst Bischof,et al.  Real-Time Tracking via On-line Boosting , 2006, BMVC.

[25]  Dana H. Ballard,et al.  Generalizing the Hough transform to detect arbitrary shapes , 1981, Pattern Recognit..

[26]  Javier R. Movellan,et al.  Dynamic Features for Visual Speechreading: A Systematic Comparison , 1996, NIPS.

[27]  Yali Amit,et al.  Shape Quantization and Recognition with Randomized Trees , 1997, Neural Computation.

[28]  Gerasimos Potamianos,et al.  An image transform approach for HMM based automatic lipreading , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[29]  Bernt Schiele,et al.  Robust Object Detection with Interleaved Categorization and Segmentation , 2008, International Journal of Computer Vision.

[30]  Sabri Gurbuz,et al.  Moving-Talker, Speaker-Independent Feature Study, and Baseline Results Using the CUAVE Multimodal Speech Corpus , 2002, EURASIP J. Adv. Signal Process..