论文信息 - Seeing wake words: Audio-visual Keyword Spotting

Seeing wake words: Audio-visual Keyword Spotting

The goal of this work is to automatically determine whether and when a word of interest is spoken by a talking face, with or without the audio. We propose a zero-shot method suitable for in the wild videos. Our key contributions are: (1) a novel convolutional architecture, KWS-Net, that uses a similarity map intermediate representation to separate the task into (i) sequence matching, and (ii) pattern detection, to decide whether the word is there and when; (2) we demonstrate that if audio is available, visual keyword spotting improves the performance both for a clean and noisy audio signal. Finally, (3) we show that our method generalises to other languages, specifically French and German, and achieves a comparable performance to English with less language specific data, by fine-tuning the network pre-trained on English. The method exceeds the performance of the previous state-of-the-art visual keyword spotting architecture when trained and tested on the same benchmark, and also that of a state-of-the-art lip reading method.

[1] Minjae Lee,et al. Online Keyword Spotting with a Character-Level Recurrent Neural Network , 2015, ArXiv.

[2] Vikrant Singh Tomar,et al. Efficient keyword spotting using time delay neural networks , 2018, INTERSPEECH.

[3] Maja Pantic,et al. Audio-Visual Speech Recognition with a Hybrid CTC/Attention Architecture , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[4] Sercan Ömer Arik,et al. Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting , 2017, INTERSPEECH.

[5] Awni Y. Hannun,et al. An End-to-End Architecture for Keyword Spotting and Voice Activity Detection , 2016, ArXiv.

[6] Joon Son Chung,et al. Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7] Hong Liu,et al. Audio-visual keyword spotting based on adaptive decision fusion under noisy conditions for human-robot interaction , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[8] Joon Son Chung,et al. Lip Reading in the Wild , 2016, ACCV.

[9] Bhuvana Ramabhadran,et al. End-to-end speech recognition and keyword search on low-resource languages , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Joon Son Chung,et al. ASR is All You Need: Cross-Modal Distillation for Lip Reading , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[12] Yundong Zhang,et al. Hello Edge: Keyword Spotting on Microcontrollers , 2017, ArXiv.

[13] Georg Heigold,et al. Small-footprint keyword spotting using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Themos Stafylakis,et al. Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.

[15] Nikko Strom,et al. Compressed Time Delay Neural Network for Small-Footprint Keyword Spotting , 2017, INTERSPEECH.

[16] Thomas Paine,et al. Large-Scale Visual Speech Recognition , 2018, INTERSPEECH.

[17] Richard F. Lyon,et al. Trainable frontend for robust and far-field keyword spotting , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Brian Kingsbury,et al. End-to-end ASR-free keyword search from speech , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Hong Liu,et al. A Novel Lip Descriptor for Audio-Visual Keyword Spotting Based on Adaptive Decision Fusion , 2016, IEEE Transactions on Multimedia.

[21] Jürgen Schmidhuber,et al. An Application of Recurrent Neural Networks to Discriminative Keyword Spotting , 2007, ICANN.

[22] Nikko Strom,et al. Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[23] Tara N. Sainath,et al. Convolutional neural networks for small-footprint keyword spotting , 2015, INTERSPEECH.

[24] John H. L. Hansen,et al. Babble Noise: Modeling, Analysis, and Applications , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[25] Joon Son Chung,et al. BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues , 2020, ECCV.

[26] Joon Son Chung,et al. The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.

[27] Wei Li,et al. Streaming small-footprint keyword spotting using sequence-to-sequence models , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[28] Shouyi Yin,et al. Small-Footprint Keyword Spotting with Graph Convolutional Network , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[29] Lukás Burget,et al. Comparison of keyword spotting approaches for informal continuous speech , 2005, INTERSPEECH.

[30] Joon Son Chung,et al. Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[31] Liang Zheng,et al. Spotting Visual Keywords from Temporal Sliding Windows , 2019, ICMI.

[32] Juhan Nam,et al. Temporal Feedback Convolutional Recurrent Neural Networks for Keyword Spotting , 2019, ArXiv.

[33] Sankaran Panchapagesan,et al. Model Compression Applied to Small-Footprint Keyword Spotting , 2016, INTERSPEECH.

[34] Junbo Zhang,et al. Sequence-to-sequence Models for Small-Footprint Keyword Spotting , 2018, ArXiv.

[35] C. V. Jawahar,et al. Word Spotting in Silent Lip Videos , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[36] Themos Stafylakis,et al. Zero-shot keyword spotting for visual speech recognition in-the-wild , 2018, ECCV.

[37] Kuldip K. Paliwal,et al. Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[38] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[39] Joon Son Chung,et al. Deep Lip Reading: a comparison of models and an online application , 2018, INTERSPEECH.

[40] Joon Son Chung,et al. Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42] Joon Son Chung,et al. Signs in time: Encoding human motion as a temporal image , 2016, ArXiv.

[43] Kai Yu,et al. Unrestricted Vocabulary Keyword Spotting Using LSTM-CTC , 2016, INTERSPEECH.

[44] Dimitri Palaz,et al. Jointly Learning to Locate and Classify Words Using Convolutional Networks , 2016, INTERSPEECH.

[45] Shimon Whiteson,et al. LipNet: Sentence-level Lipreading , 2016, ArXiv.

[46] Joon Son Chung,et al. LRS3-TED: a large-scale dataset for visual speech recognition , 2018, ArXiv.