Spotting Visual Keywords from Temporal Sliding Windows

Visual Keyword Spotting (KWS), as a newly proposed task deriving from visual speech recognition, has plenty of room for improvements. This paper details our Visual Keyword Spotting system used in the first Mandarin Audio-Visual Speech Recognition Challenge (MAVSR 2019). With the assumption that the vocabularies of target dataset are a subset of the vocabulary of the training set, we proposed a simple and scalable classification based strategy that achieves 19.0% mean average precision (mAP) on this challenge. Our method is based on the idea of using sliding windows to bridge between the word-level dataset and the sentence-level dataset, showing that a strong word level classifier can be directly used in building sentence embedding, thereby making it possible to build a KWS system.

[1]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[3]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Themos Stafylakis,et al.  Zero-shot keyword spotting for visual speech recognition in-the-wild , 2018, ECCV.

[5]  Rahul Sukthankar,et al.  Rethinking the Faster R-CNN Architecture for Temporal Action Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Themos Stafylakis,et al.  Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.

[7]  Kate Saenko,et al.  R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Andrzej Czyzewski,et al.  An audio-visual corpus for multimodal automatic speech recognition , 2017, Journal of Intelligent Information Systems.

[10]  Shiguang Shan,et al.  LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild , 2018, 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019).

[11]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.