Multistream Articulatory Feature-Based Models for Visual

We study the problem of automatic visual speech recognition (VSR) using dynamic Bayesian network (DBN)-based models consisting of multiple sequences of hidden states, each corresponding to an articulatory feature (AF) such as lip opening (LO) or lip rounding (LR). A bank of discriminative articulatory feature classifiers provides input to the DBN, in the form of either virtual evidence (VE) (scaled likelihoods) or raw classifier margin outputs. We present experiments on two tasks, a medium-vocabulary word-ranking task and a small-vocabulary phrase recognition task. We show that articulatory feature-based models outperform baseline models, and we study several aspects of the models, such as the effects of allowing articulatory asynchrony, of using dictionary-based versus whole-word models, and of incorporating classifier outputs via virtual evidence versus alternative observation models.

[1]  Kevin P. Murphy,et al.  A coupled HMM for audio-visual speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Stephen E. Levinson,et al.  A fused hidden Markov model with application to bimodal speech processing , 2004, IEEE Transactions on Signal Processing.

[3]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[4]  P. Bartlett,et al.  Probabilities for SV Machines , 2000 .

[5]  Simon King,et al.  Detection of phonological features in continuous speech using neural networks , 2000, Comput. Speech Lang..

[6]  Jeff A. Bilmes,et al.  Hidden-articulator Markov models for speech recognition , 2003, Speech Commun..

[7]  Harriet J. Nock,et al.  Modelling asynchrony in automatic speech recognition using loosely coupled hidden Markov models , 2002, Cogn. Sci..

[8]  Gernot A. Fink,et al.  Combining acoustic and articulatory feature information for robust speech recognition , 2002, Speech Commun..

[9]  Karen Livescu Articulatory Feature-based Methods for Acoustic and Audio-Visual Speech Recognition : 2006 JHU Summer Workshop Final Report 1 , 2007 .

[10]  James R. Glass,et al.  Feature-based pronunciation modeling with trainable asynchrony probabilities , 2004, INTERSPEECH.

[11]  Trevor Darrell,et al.  Visual speech recognition with loosely synchronized feature streams , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[12]  James R. Glass,et al.  A segment-based audio-visual speech recognizer: data collection, development, and initial experiments , 2004, ICMI '04.

[13]  Trevor Darrell,et al.  Articulatory features for robust visual speech recognition , 2004, ICMI '04.

[14]  Ioannis Pitas,et al.  A Support Vector Machine-Based Dynamic Network for Visual Speech Recognition Applications , 2002, EURASIP J. Adv. Signal Process..

[15]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[16]  Geoffrey Zweig,et al.  The graphical models toolkit: An open source software system for speech and time-series processing , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[18]  Kate Saenko,et al.  AUDIOVISUAL SPEECH RECOGNITION WITH ARTICULATOR POSITIONS AS HIDDEN VARIABLES , 2007 .

[19]  Chalapathy Neti,et al.  Asynchrony modeling for audio-visual speech recognition , 2002 .

[20]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[21]  Li Deng,et al.  Production models as a structural basis for automatic speech recognition , 1997, Speech Commun..

[22]  Simon King,et al.  An Articulatory Feature-Based Tandem Approach and Factored Observation Modeling , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[23]  C. Browman,et al.  Articulatory Phonology: An Overview , 1992, Phonetica.

[24]  Keiji Kanazawa,et al.  A model for reasoning about persistence and causation , 1989 .

[25]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[26]  J.A. Bilmes,et al.  Graphical model architectures for speech recognition , 2005, IEEE Signal Processing Magazine.

[27]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[28]  James R. Glass,et al.  Feature-based Pronunciation Modeling for Speech Recognition , 2004, HLT-NAACL.

[29]  Simon King,et al.  Speech production knowledge in automatic speech recognition. , 2007, The Journal of the Acoustical Society of America.

[30]  Trevor Darrell,et al.  Production domain modeling of pronunciation for visual speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[31]  Hervé Bourlard,et al.  Continuous speech recognition , 1995, IEEE Signal Process. Mag..

[32]  G. Zweig,et al.  Speech recognition using dynamic Bayesian networks , 1998 .

[33]  Partha Niyogi,et al.  Feature based representation for audio-visual speech recognition , 1999, AVSP.

[34]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[35]  Stuart J. Russell,et al.  Dynamic bayesian networks: representation, inference and learning , 2002 .