Now You're Speaking My Language: Visual Language Identification

The goal of this work is to train models that can identify a spoken language just by interpreting the speaker’s lip movements. Our contributions are the following: (i) we show that models can learn to discriminate among 14 different languages using only visual speech information; (ii) we compare different designs in sequence modelling and utterance-level aggregation in order to determine the best architecture for this task; (iii) we investigate the factors that contribute discriminative cues and show that our model indeed solves the problem by finding temporal patterns in mouth movements and not by exploiting spurious correlations. We demonstrate this further by evaluating our models on challenging examples from bilingual speakers.

[1]  Ming Li,et al.  End-to-end Language Identification using NetFV and NetVLAD , 2018, 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[2]  Tomás Pajdla,et al.  NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[4]  Ming Li,et al.  Utterance-level End-to-end Language Identification Using Attention-based CNN-BLSTM , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Patrick Pérez,et al.  Revisiting the VLAD image representation , 2013, ACM Multimedia.

[8]  Hanna Mazzawi,et al.  Improving Keyword Spotting and Language Identification via Neural Architecture Search at Scale , 2019, INTERSPEECH.

[9]  Athena Vouloumanos,et al.  Discriminating languages by speech-reading , 2007, Perception & psychophysics.

[10]  Whitney M. Weikum,et al.  Visual Language Discrimination in Infancy , 2007, Science.

[11]  Aparna Brahme,et al.  Lip Detection and Lip Geometric Feature Extraction using Constrained Local Model for Spoken Language Identification using Visual Speech Recognition , 2016 .

[12]  Roger Hsiao,et al.  Improving Language Identification for Multilingual Speakers , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Douglas A. Reynolds,et al.  Deep Neural Network Approaches to Speaker and Language Recognition , 2015, IEEE Signal Processing Letters.

[14]  Bo Xu,et al.  End-to-End Language Identification Using Attention-Based Recurrent Neural Networks , 2016, INTERSPEECH.

[15]  Shilin Wang,et al.  Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Thomas Paine,et al.  Large-Scale Visual Speech Recognition , 2018, INTERSPEECH.

[17]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Tetsuya Ogata,et al.  Lipreading using convolutional neural network , 2014, INTERSPEECH.

[19]  Maja Pantic,et al.  Audio-Visual Speech Recognition with a Hybrid CTC/Attention Architecture , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[20]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Pedro J. Moreno,et al.  A Real-Time End-to-End Multilingual Speech Recognition Architecture , 2015, IEEE Journal of Selected Topics in Signal Processing.

[22]  Lukás Burget,et al.  Language Recognition in iVectors Space , 2011, INTERSPEECH.

[23]  Ming Li,et al.  Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System , 2018, Odyssey.

[24]  Omkar M. Parkhi,et al.  VGGFace2: A Dataset for Recognising Faces across Pose and Age , 2017, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[25]  Alex Waibel,et al.  Neural Codes to Factor Language in Multilingual Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Themos Stafylakis,et al.  Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.

[27]  Jiri Matas,et al.  Visual Language Identification from Facial Landmarks , 2017, SCIA.

[28]  Roger Lass,et al.  Phonology: An Introduction to Basic Concepts , 1984 .

[29]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[30]  Joon Son Chung,et al.  Deep Lip Reading: a comparison of models and an online application , 2018, INTERSPEECH.

[31]  David A. Ross,et al.  Automatic Language Identification in music videos with low level audio and visual features , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Yonghong Yan,et al.  A New Time-Frequency Attention Mechanism for TDNN and CNN-LSTM-TDNN, with Application to Language Identification , 2019, INTERSPEECH.

[33]  Sriram Ganapathy,et al.  Towards Relevance and Sequence Modeling in Language Recognition , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[34]  Joon Son Chung,et al.  Utterance-level Aggregation for Speaker Recognition in the Wild , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[36]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[37]  Olivier Siohan,et al.  Recurrent Neural Network Transducer for Audio-Visual Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[38]  Quan Wang,et al.  Tuplemax Loss for Language Identification , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Stephen J. Cox,et al.  Speaker independent visual-only language identification , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[40]  Stephen J. Cox,et al.  Language Identification Using Visual Features , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  David B. Pisoni,et al.  Language identification from visual-only speech signals , 2010, Attention, perception & psychophysics.

[42]  William M. Campbell,et al.  Support vector machines for speaker and language recognition , 2006, Comput. Speech Lang..