Learning Fuzzy Rules for Visual Speech Recognition

We outline a method to learn fuzzy rules for visual speech recognition. Such a system could be used in automatic annotation of video sequences, to aid subsequent retrieval; it could also be used to improve the recognition of voice commands when a system has no keyboard. In the implemented system, features were extracted automatically from short video sequences, by identifying regions of the face and tracking the movement of various points around the mouth from frame to frame. The words in video sequences were segmented manually on phoneme boundaries and a rule base was constructed using two-dimensional fuzzy sets on feature and time parameters. The method was applied to the Tulips1 database and results were slightly better than those obtained with techniques based on neural networks and Hidden Markov Models. This suggests that the learned rules are speaker independent. A medium sized vocabulary of around 300 words, representative of phonemes in the English language, was created and used for training and testing. Reasonable accuracy for phoneme classification was achieved. Because of the ambiguity and similarity of various speech sounds a scheme was developed to select a group of words when a test word was presented to the system. The accuracy achieved was 21-33%, comparable to expert human lip-readers whose accuracy on nonsense words is about 30%.

[1]  James F. Baldwin,et al.  Machine Interpretation of Facial Expressions , 2000, Intelligent Systems and Soft Computing.

[2]  Trevor P Martin,et al.  Machine Interpretation of Facial Expressions , 1998 .

[3]  James F. Baldwin Mass assignments and fuzzy sets for fuzzy databases , 1994 .

[4]  Alan C. Bovik,et al.  Computer lipreading for improved accuracy in automatic speech recognition , 1996, IEEE Trans. Speech Audio Process..

[5]  Eric Petajan,et al.  Approaches to visual speech processing based on the MPEG-4 Face Animation standard , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[6]  Javier R. Movellan,et al.  Visual Speech Recognition with Stochastic Networks , 1994, NIPS.

[7]  James F. Baldwin,et al.  Automatic computer lip-reading using fuzzy set theory , 1999, AVSP.

[8]  Juergen Luettin,et al.  Speechreading using shape and intensity information , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[9]  Ryen W. White,et al.  NewsFlash: Adaptive TV News Delivery on the Web , 2003, Adaptive Multimedia Retrieval.

[10]  Allen A. Montgomery,et al.  Automatic optically-based recognition of speech , 1988, Pattern Recognit. Lett..

[11]  Paul Mineiro,et al.  A diffusion network approach to visual speech recognition , 1999, AVSP.

[12]  Terrence J. Sejnowski,et al.  Neural network models of sensory integration for improved vowel recognition , 1990, Proc. IEEE.

[13]  B.P. Yuhas,et al.  Integration of acoustic and visual speech signals using neural networks , 1989, IEEE Communications Magazine.

[14]  Lotfi A. Zadeh,et al.  Fuzzy logic = computing with words , 1996, IEEE Trans. Fuzzy Syst..

[15]  Alan Jeffrey Goldschen,et al.  Continuous automatic speech recognition by lipreading , 1993 .

[16]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[17]  Juergen Luettin,et al.  Speechreading using Probabilistic Models , 1997, Comput. Vis. Image Underst..