Personalized One-Shot Lipreading for an ALS Patient

Lipreading or visually recognizing speech from the mouth movements of a speaker is a challenging and mentally taxing task. Unfortunately, multiple medical conditions force people to depend on this skill in their day-to-day lives for essential communication. Patients suffering from ‘Amyotrophic Lateral Sclerosis’ (ALS) often lose muscle control, consequently their ability to generate speech and communicate via lip movements. Existing large datasets do not focus on medical patients or curate personalized vocabulary relevant to an individual. Collecting large-scale dataset of a patient, needed to train modern data-hungry deep learning models is however, extremely challenging. In this work, we propose a personalized network to lipread an ALS patient using only one-shot examples. We depend on synthetically generated lip movements to augment the one-shot scenario. A Variational Encoder based domain adaptation technique is used to bridge the real-synthetic domain gap. Our approach significantly improves and achieves high top-5 accuracy with 83.2% accuracy compared to 62.6% achieved by comparable methods for the patient. Apart from evaluating our approach on the ALS patient, we also extend it to people with hearing impairment relying extensively on lip movements to communicate.

[1]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  P. van Damme,et al.  Amyotrophic lateral sclerosis: a clinical review , 2020, European journal of neurology.

[3]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[4]  Shuang Yang,et al.  Learn an Effective Lip Reading Model without Pains , 2020, ArXiv.

[5]  Tao Qin,et al.  FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2021, ICLR.

[6]  Trevor Darrell,et al.  Adversarial Discriminative Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Joon Son Chung,et al.  LRS3-TED: a large-scale dataset for visual speech recognition , 2018, ArXiv.

[8]  Johanna Palmio,et al.  Speech deterioration in amyotrophic lateral sclerosis (ALS) after manifestation of bulbar symptoms. , 2018, International journal of language & communication disorders.

[9]  Sungwon Kim,et al.  Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search , 2020, NeurIPS.

[10]  Susan Fager,et al.  Communication Support for People with ALS , 2011, Neurology research international.

[11]  Sandesh Ghimire,et al.  Semi-supervised Medical Image Classification with Global Latent Mixing , 2020, MICCAI.

[12]  Eric P. Xing,et al.  Domain Adaption in One-Shot Learning , 2018, ECML/PKDD.

[13]  Nima Tajbakhsh,et al.  Surrogate Supervision for Medical Image Analysis: Effective Deep Learning From Limited Quantities of Labeled Data , 2019, 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019).

[14]  C. V. Jawahar,et al.  Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Amirsina Torfi,et al.  3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition , 2017, IEEE Access.

[16]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[18]  C. V. Jawahar,et al.  A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild , 2020, ACM Multimedia.

[19]  Masaaki Iiyama,et al.  Partially-Shared Variational Auto-encoders for Unsupervised Domain Adaptation with Target Shift , 2020, ECCV.

[20]  Yu Zhang,et al.  Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[21]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[22]  Maja Pantic,et al.  Lipreading Using Temporal Convolutional Networks , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Sara Zarei,et al.  A comprehensive review of amyotrophic lateral sclerosis , 2015, Surgical neurology international.

[24]  L. Rosenblum,et al.  Lip-Read Me Now, Hear Me Better Later , 2006, Psychological science.

[25]  Kai Xu,et al.  LCANet: End-to-End Lipreading with Cascaded Attention-CTC , 2018, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[26]  Hayit Greenspan,et al.  GAN-based Synthetic Medical Image Augmentation for increased CNN Performance in Liver Lesion Classification , 2018, Neurocomputing.

[27]  MarchandMario,et al.  Domain-adversarial training of neural networks , 2016 .