论文信息 - Spotting Visual Keywords from Temporal Sliding Windows

Spotting Visual Keywords from Temporal Sliding Windows

Visual Keyword Spotting (KWS), as a newly proposed task deriving from visual speech recognition, has plenty of room for improvements. This paper details our Visual Keyword Spotting system used in the first Mandarin Audio-Visual Speech Recognition Challenge (MAVSR 2019). With the assumption that the vocabularies of target dataset are a subset of the vocabulary of the training set, we proposed a simple and scalable classification based strategy that achieves 19.0% mean average precision (mAP) on this challenge. Our method is based on the idea of using sliding windows to bridge between the word-level dataset and the sentence-level dataset, showing that a strong word level classifier can be directly used in building sentence embedding, thereby making it possible to build a KWS system.

[1] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Jon Barker,et al. An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[3] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4] Themos Stafylakis,et al. Zero-shot keyword spotting for visual speech recognition in-the-wild , 2018, ECCV.

[5] Rahul Sukthankar,et al. Rethinking the Faster R-CNN Architecture for Temporal Action Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6] Themos Stafylakis,et al. Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.

[7] Kate Saenko,et al. R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8] Joon Son Chung,et al. Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9] Andrzej Czyzewski,et al. An audio-visual corpus for multimodal automatic speech recognition , 2017, Journal of Intelligent Information Systems.

[10] Shiguang Shan,et al. LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild , 2018, 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019).

[11] Joon Son Chung,et al. Lip Reading in the Wild , 2016, ACCV.