Cross-Domain Deep Visual Feature Generation for Mandarin Audio–Visual Speech Recognition

There has been a long term interest in using visual information to improve automatic speech recognition (ASR) system performance. Both audio and visual information are required in conventional audio visual speech recognition (AVSR) systems. This limits their wider applications when visual modality is not present. To this end, one possible solution is to use acoustic-to-visual (A2V) inversion techniques to generate visual features. Previous research in this direction used synthetic acoustic-articulatory parallel data in inversion model training. The acoustic mismatch between the audio-visual (AV) parallel data and target data was not considered. In addition, the target language to apply these technologies has been focused on English. In this article, a real 3D Audio-Visual Mandarin Continuous Speech (3DAV-MCS) corpus was used to train deep neural network based A2V inversion models. Cross-domain adaptation of the inversion models allows suitable visual features to be generated from acoustic data of mismatched domains. The proposed cross-domain deep visual feature generation techniques were evaluated on two state-of-the-art Mandarin speech recognition tasks: DAPRA GALE broadcast transcription and BOLT conversational telephone speech recognition. The AVSR systems constructed using the cross-domain generated visual features consistently outperformed the baseline convolutional neural network (CNN) ASR systems by up to 3.3% absolute (9.1% relative) character error rate (CER) reductions after both speaker adaptive training and sequence discriminative training were performed.

[1]  Maja Pantic,et al.  End-to-End Audiovisual Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Lan Wang,et al.  Semi-supervised Cross-domain Visual Feature Learning for Audio-Visual Broadcast Speech Transcription , 2018, INTERSPEECH.

[3]  Jianwu Dang,et al.  Audio-visual speech recognition integrating 3D lip information obtained from the Kinect , 2016, Multimedia Systems.

[4]  Vaibhava Goel,et al.  Deep multimodal learning for Audio-Visual Speech Recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Mark J. F. Gales,et al.  The Cambridge University 2014 BOLT conversational telephone Mandarin Chinese LVCSR system for speech translation , 2015, INTERSPEECH.

[6]  Zhigang Deng,et al.  Natural head motion synthesis driven by acoustic prosodic features , 2005, Comput. Animat. Virtual Worlds.

[7]  Mark J. F. Gales,et al.  Development of the CUHTK 2004 Mandarin conversational telephone speech transcription system , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[8]  Reda A. El-Khoribi,et al.  Audio-Visual Speech Recognition for People with Speech Disorders , 2014 .

[9]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[10]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Georg Heigold,et al.  Development of the 2007 RWTH Mandarin LVCSR system , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[12]  Wen Wang,et al.  Articulatory Information and Multiview Features for Large Vocabulary Continuous Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[14]  Tetsuya Takiguchi,et al.  Audio-Visual Speech Recognition for a Person with Severe Hearing Loss Using Deep Canonical Correlation Analysis , 2017 .

[15]  Dimitra Vergyri,et al.  Joint modeling of articulatory and acoustic spaces for continuous speech recognition tasks , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[17]  Yi Liu,et al.  Recent advances in the IBM GALE Mandarin transcription system , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Benjamin Schrauwen,et al.  Training and Analysing Deep Recurrent Neural Networks , 2013, NIPS.

[19]  Yifan Gong,et al.  Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[20]  Andreas Stolcke,et al.  Articulatory trajectories for large-vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Tetsuya Ogata,et al.  Audio-visual speech recognition using deep learning , 2014, Applied Intelligence.

[22]  Ahmed Farag,et al.  A robust speech disorders correction system for Arabic language usingvisual speech recognition. , 2013 .

[23]  Simon King,et al.  Deep neural networks employing Multi-Task Learning and stacked bottleneck features for speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Bin Ma,et al.  Robust Audio-visual Speech Recognition Using Bimodal Dfsmn with Multi-condition Training and Dropout Regularization , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[26]  Ben P. Milner,et al.  Audio-to-Visual Speech Conversion Using Deep Neural Networks , 2016, INTERSPEECH.

[27]  Satoshi Tamura,et al.  Integration of deep bottleneck features for audio-visual speech recognition , 2015, INTERSPEECH.

[28]  Peng Liu,et al.  A deep recurrent approach for acoustic-to-articulatory inversion , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Elliot Saltzman,et al.  Hybrid convolutional neural networks for articulatory and acoustic information based speech recognition , 2017, Speech Commun..

[30]  Wei Chen,et al.  Modality Attention for End-to-end Audio-visual Speech Recognition , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Jing Huang,et al.  Audio-visual deep learning for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[33]  Dani Byrd,et al.  TADA: An enhanced, portable Task Dynamics model in MATLAB , 2004 .

[34]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Chao Zhang,et al.  Supplementary data for "Parameterised Sigmoid and ReLU HiddenActivation Functions for DNN Acoustic Modelling" , 2015 .

[36]  Jenq-Neng Hwang,et al.  Hidden Markov Model Inversion for Audio-to-Visual Conversion in an MPEG-4 Facial Animation System , 2001, J. VLSI Signal Process..

[37]  Farshad Almasganj,et al.  Audio-visual feature fusion via deep neural networks for automatic speech recognition , 2018, Digit. Signal Process..

[38]  Elliot Saltzman,et al.  Articulatory features from deep neural networks and their role in speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[40]  Ricardo Gutierrez-Osuna,et al.  Audio/visual mapping with cross-modal hidden Markov models , 2005, IEEE Transactions on Multimedia.

[41]  Etsuya,et al.  Audio-Visual Speech Recognition Using Convolutive Bottleneck Networks for a Person with Severe Hearing Loss , 2015 .

[42]  Elliot Saltzman,et al.  Articulatory Information for Noise Robust Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[43]  Tetsuya Takiguchi,et al.  Multimodal speech recognition of a person with articulation disorders using AAM and MAF , 2010, 2010 IEEE International Workshop on Multimedia Signal Processing.

[44]  Dong Yu,et al.  Exploring convolutional neural network structures and optimization techniques for speech recognition , 2013, INTERSPEECH.

[45]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Carlos Busso,et al.  Gating Neural Network for Large Vocabulary Audiovisual Speech Recognition , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[47]  Jun Yu,et al.  A multi-channel/multi-speaker interactive 3D audio-visual speech corpus in Mandarin , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[48]  Lan Wang,et al.  Deep Neural Network Based Acoustic-to-Articulatory Inversion Using Phone Sequence Information , 2016, INTERSPEECH.

[49]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[50]  Florian Metze,et al.  Open-Domain Audio-Visual Speech Recognition: A Deep Learning Approach , 2016, INTERSPEECH.