Evolving connectionist method for adaptive audiovisual speech recognition

Reliability is the primary requirement in noisy conditions and for highly variable utterances. Integrating the recognition of visual signals with the recognition of audio signals is indispensable for many applications that require automatic speech recognition (ASR) in harsh conditions. Several important experiments have shown that integrating and adapting to multiple behavioral end-context information during the speech-recognition task significantly improves its success rate. By integrating audio and visual data from speech information, we can improve the performance of an ASR system by differentiating between the most critical cases of phonetic-unit mismatch that occur when processing audio or visual input alone. The evolving fuzzy neural-network (EFuNN) inference method is applied at the decision layer to accomplish this task. This is done through a paradigm that adapts to the environment by changing structure. The EFuNN’s capacity to learn quickly from incoming data and to adapt while on line lowers the ASR system’s complexity and enhances its performance in harsh conditions. Two independent feature extractors were developed, one for speech phonetics (listening to the speech) and the other for speech visemics (lip-reading the spoken input). The EFuNN network was trained to fuse decisions made disjointly by the audio unit and the visual unit. Our experiments have confirmed that the proposed method is reliable for developing a robust, automatic, speech-recognition system.

[1]  M. Malcangi,et al.  Audio-visual fuzzy fusion for robust speech recognition , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[2]  Moo G. Joo,et al.  A method of converting conventional fuzzy logic system to 2 layered hierarchical fuzzy system , 2003, The 12th IEEE International Conference on Fuzzy Systems, 2003. FUZZ '03..

[3]  Daniel B. Wright,et al.  Mixing sound and vision: The interaction of auditory and visual information for earwitnesses of a crime scene , 2005 .

[4]  Reda A. El-Khoribi,et al.  Audio-Visual Speech Recognition for People with Speech Disorders , 2014 .

[5]  Kate Saenko,et al.  AN ASYNCHRONOUS DBN FOR AUDIO-VISUAL SPEECH RECOGNITION , 2006, 2006 IEEE Spoken Language Technology Workshop.

[6]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[7]  Tetsuya Ogata,et al.  Audio-visual speech recognition using deep learning , 2014, Applied Intelligence.

[8]  Naomi Harte,et al.  Phoneme-to-viseme Mapping for Visual Speech Recognition , 2012, ICPRAM.

[9]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[10]  Alexander H. Waibel,et al.  A real-time face tracker , 1996, Proceedings Third IEEE Workshop on Applications of Computer Vision. WACV'96.

[11]  Tobias Höllerer,et al.  Multimodal interaction with a wearable augmented reality system , 2006, IEEE Computer Graphics and Applications.

[12]  Michael J. Watts,et al.  A Decade of Kasabov's Evolving Connectionist Systems: A Review , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[13]  Michael J. Watts,et al.  Simple evolving connectionist systems and experiments on isolated phoneme recognition , 2000, 2000 IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks. Proceedings of the First IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks (Cat. No.00.

[14]  Nikola K. Kasabov,et al.  Evolving fuzzy neural networks for supervised/unsupervised online knowledge-based learning , 2001, IEEE Trans. Syst. Man Cybern. Part B.

[15]  Christian Benoît,et al.  Which components of the face do humans and machines best speechread , 1996 .

[16]  Gregory J. Wolff,et al.  Neural network lipreading system for improved speech recognition , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[17]  D. Massaro Bimodal Speech Perception: A Progress Report , 1996 .

[18]  Lynne E. Bernstein,et al.  Word Recognition in Speechreading , 1996 .

[19]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[20]  S. J. Sinclair,et al.  The development of the Otago speech database , 1995, Proceedings 1995 Second New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems.

[21]  Chalapathy Neti,et al.  Audio-visual large vocabulary continuous speech recognition in the broadcast domain , 1999, 1999 IEEE Third Workshop on Multimedia Signal Processing (Cat. No.99TH8451).

[22]  Martin Klimo,et al.  Bimodal vowel recognition using fuzzy logic networks - naive approach , 2014, 2014 ELEKTRO.

[23]  Joe Marshall,et al.  Mobile interaction does not exist , 2013, CHI Extended Abstracts.

[24]  Eric O. Postma,et al.  AVIS: a connectionist-based framework for integrated auditory and visual information processing , 2000, Inf. Sci..

[25]  Rainer Stiefelhagen,et al.  Real-time lip-tracking for lipreading , 1997, EUROSPEECH.

[26]  Mario Malcangi,et al.  Audio Based Real-Time Speech Animation of Embodied Conversational Agents , 2003, Gesture Workshop.

[27]  Andrew Blake,et al.  Real-Time Lip Tracking for Audio-Visual Speech Recognition Applications , 1996, ECCV.