Patch-Based Representation of Visual Speech

Visual information from a speaker's mouth region is known to improve automatic speech recognition robustness, especially in the presence of acoustic noise. To date, the vast majority of work in this field has viewed these visual features in a holistic manner, which may not take into account the various changes that occur within articulation (process of changing the shape of the vocal tract using the articulators, i.e lips and jaw). Motivated by the work being conducted in fields of audio-visual automatic speech recognition (AVASR) and face recognition using articulatory features (AFs) and patches respectively, we present a proof of concept paper which represents the mouth region as a ensemble of image patches. Our experiments show that by dealing with the mouth region in this manner, we are able to extract more speech information from the visual domain. For the task of visual-only speaker-independent isolated digit recognition, we were able to improve the relative word error rate by more than 23\% on the CUAVE audio-visual corpus.

[1]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[2]  Roberto Brunelli,et al.  Face Recognition: Features Versus Templates , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Oscar N. Garcia,et al.  Continuous optical automatic speech recognition by lipreading , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[5]  Alex Pentland,et al.  Probabilistic Visual Learning for Object Representation , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  David J. Kriegman,et al.  Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection , 1996, ECCV.

[7]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[8]  Sridha Sridharan,et al.  An approach to statistical lip modelling for speaker identification via chromatic feature extraction , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[9]  Ara V. Nefian,et al.  Speaker independent audio-visual continuous speech recognition , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[10]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Aleix M. Martínez,et al.  Recognizing Imprecisely Localized, Partially Occluded, and Expression Variant Faces from a Single Sample per Class , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[13]  Sridha Sridharan,et al.  Robust Face Localisation Using Motion, Colour and Fusion , 2003, DICTA.

[14]  Takeo Kanade,et al.  Multi-subregion based probabilistic approach toward pose-invariant face recognition , 2003, Proceedings 2003 IEEE International Symposium on Computational Intelligence in Robotics and Automation. Computational Intelligence in Robotics and Automation for the New Millennium (Cat. No.03EX694).

[15]  Trevor Darrell,et al.  Articulatory features for robust visual speech recognition , 2004, ICMI '04.

[16]  Jeff A. Bilmes,et al.  DBN based multi-stream models for audio-visual speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Gerasimos Potamianos,et al.  Exploiting lower face symmetry in appearance-based automatic speechreading , 2005, AVSP.

[18]  Tsuhan Chen,et al.  Learning Patch Dependencies for Improved Pose Mismatched Face Verification , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).