Audio-visual keyword spotting based on adaptive decision fusion under noisy conditions for human-robot interaction

Keyword spotting (KWS) deals with the identification of keywords in unconstrained speech, which is a natural, straightforward and friendly way for human-robot interaction (HRI). Most keyword spotters have the common problem of noise-robustness when applied to real-world environment with dramatically changing noises. Since visual information won't be affected by the acoustic noise, it can be utilized to complementarily improve the noise-robustness. In this paper, a novel audio-visual keyword spotting approach based on adaptive decision fusion under noisy conditions is proposed. In order to accurately represent the appearance and movement of mouth region, an improved local binary pattern from three orthogonal planes (ILBP-TOP) is proposed. Besides, a parallel two-step recognition based on acoustic and visual keyword candidates is conducted and generates corresponding acoustic and visual scores for each keyword candidate. Optimal weights for combining acoustic and visual contributions under diverse noise conditions are generated using a neural network based on reliabilities of the two modalities. Experiments show that our proposed audio-visual keyword spotting based on decision fusion significantly improves the noise robustness and attains better performance than feature fusion based audiovisual spotter. Additionally, ILBP-TOP shows more competitive performance than LBP-TOP.

[1]  Hongbin Zha,et al.  Modeling facial expression space for recognition , 2005, 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[2]  Mohan M. Trivedi,et al.  Hierarchical audio-visual cue integration framework for activity analysis in intelligent meeting rooms , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[3]  Cheol Hoon Park,et al.  Adaptive Decision Fusion for Audio-Visual Speech Recognition , 2008 .

[4]  Matti Pietikäinen,et al.  Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Trent W. Lewis,et al.  Sensor Fusion Weighting Measures in Audio-Visual Speech Recognition , 2004, ACSC.

[6]  Richard Rose,et al.  A hidden Markov model based keyword recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[7]  Sadaoki Furui,et al.  A stream-weight optimization method for multi-stream HMMs based on likelihood value normalization , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[8]  Haiyang Li,et al.  Mandarin keyword spotting using syllable based confidence features and SVM , 2011, 2011 2nd International Conference on Intelligent Control and Information Processing.

[9]  Aggelos K. Katsaggelos,et al.  Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features , 2002, EURASIP J. Adv. Signal Process..

[10]  Hiroshi G. Okuno,et al.  Automatic speech recognition improved by two-layered audio-visual integration for robot audition , 2009, 2009 9th IEEE-RAS International Conference on Humanoid Robots.

[11]  Lukás Burget,et al.  Comparison of keyword spotting approaches for informal continuous speech , 2005, INTERSPEECH.

[12]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[13]  Alexandrina Rogozan,et al.  Adaptive fusion of acoustic and visual sources for automatic speech recognition , 1998, Speech Commun..

[14]  Waleed H. Abdulla,et al.  WFST-based Large Vocabulary Continuous Speech Decoder for Service Robots , 2012 .

[15]  Stefan Wermter,et al.  Towards Robust Speech Recognition for Human-Robot Interaction , 2011 .

[16]  Matti Pietikäinen,et al.  Towards a practical lipreading system , 2011, CVPR 2011.

[17]  Jeff A. Bilmes,et al.  DBN based multi-stream models for audio-visual speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[19]  Mohan M. Trivedi,et al.  Audio-Visual Fusion and Tracking With Multilevel Iterative Decoding: Framework and Experimental Evaluation , 2010, IEEE Journal of Selected Topics in Signal Processing.

[20]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[21]  Ziyou Xiong,et al.  Audio visual word spotting , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  Tetsuya Ogata,et al.  Real-Time Robot Audition System That Recognizes Simultaneous Speech in The Real World , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[23]  Md. Tariquzzaman,et al.  Performance Improvement of Audio-Visual Speech Recognition with Optimal Reliability Fusion , 2011, 2011 International Conference on Internet Computing and Information Services.

[24]  Stephen J. Cox,et al.  Audiovisual speech recognition using multiscale nonlinear image decomposition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[25]  Zhiwei Shuang,et al.  Improved Mandarin Keyword Spotting Using Confusion Garbage Model , 2010, 2010 20th International Conference on Pattern Recognition.

[26]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  A. Adjoudani,et al.  On the Integration of Auditory and Visual Parameters in an HMM-based ASR , 1996 .

[28]  Trevor Darrell,et al.  Visual speech recognition with loosely synchronized feature streams , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[29]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[30]  Chalapathy Neti,et al.  Stream confidence estimation for audio-visual speech recognition , 2000, INTERSPEECH.

[31]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.