Understanding Pictograph with Facial Features: End-to-End Sentence-Level Lip Reading of Chinese

With the breakthrough of deep learning, lip reading technologies are under extraordinarily rapid progress. It is well-known that Chinese is the most widely spoken language in the world. Unlike alphabetic languages, it involves more than 1,000 pronunciations as Pinyin, and nearly 90,000 pictographic characters as Hanzi, which makes lip reading of Chinese very challenging. In this paper, we implement visual-only Chinese lip reading of unconstrained sentences in a two-step end-to-end architecture (LipCH-Net), in which two deep neural network models are employed to perform the recognition of Pictureto-Pinyin (mouth motion pictures to pronunciations) and the recognition of Pinyin-to-Hanzi (pronunciations to texts) respectively, before having a jointly optimization to improve the overall performance. In addition, two modules in the Pinyin-to-Hanzi model are pre-trained separately with large auxiliary data in advance of sequence-to-sequence training to make the best of long sequence matches for avoiding ambiguity. We collect 6-month daily news broadcasts from China Central Television (CCTV) website, and semi-automatically label them into a 20.95 GB dataset with 20,495 natural Chinese sentences. When trained on the CCTV dataset, the LipCH-Net model outperforms the performance of all stateof-the-art lip reading frameworks. According to the results, our scheme not only accelerates training and reduces overfitting, but also overcomes syntactic ambiguity of Chinese which provides a baseline for future relevant work.

[1]  Stephen J. Cox,et al.  The challenge of multispeaker lip-reading , 2008, AVSP.

[2]  Tetsuya Ogata,et al.  Audio-visual speech recognition using deep learning , 2014, Applied Intelligence.

[3]  Shimon Whiteson,et al.  LipNet: End-to-End Sentence-level Lipreading , 2016, 1611.01599.

[4]  Matti Pietikäinen,et al.  A review of recent advances in visual speech decoding , 2014, Image Vis. Comput..

[5]  Matti Pietikäinen,et al.  OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[6]  Jing Huang,et al.  Audio-visual deep learning for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[8]  Maja Pantic,et al.  Deep complementary bottleneck features for visual speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[10]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[11]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[12]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[13]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[14]  Vaibhava Goel,et al.  Deep multimodal learning for Audio-Visual Speech Recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Amit Garg amit,et al.  Lip reading using CNN and LSTM , 2016 .

[16]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[18]  Jürgen Schmidhuber,et al.  Lipreading with long short-term memory , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[20]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Tetsuya Ogata,et al.  Lipreading using convolutional neural network , 2014, INTERSPEECH.

[23]  Qingyuan Tian Comparing information entropy between Chinese and English language , 2012 .

[24]  Stephen J. Cox,et al.  Improving lip-reading performance for robust audiovisual speech recognition using DNNs , 2015, AVSP.

[25]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[26]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Daniel Jurafsky,et al.  Lexicon-Free Conversational Speech Recognition with Neural Networks , 2015, NAACL.

[28]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .