Fusion Architectures for Word-Based Audiovisual Speech Recognition

In this study we investigate architectures for modality fusion in audiovisual speech recognition, where one aims to alleviate the adverse effect of acoustic noise on the speech recognition accuracy by using video images of the speaker’s face as an additional modality. Starting from an established neural network fusion system, we substantially improve the recognition accuracy by taking single-modality losses into account: late fusion (at the output logits level) is substantially more robust than the baseline, in particular for unseen acoustic noise, at the expense of having to determine the optimal weighting of the input streams. The latter requirement can be removed by making the fusion itself a trainable part of the network.

[1]  Michael Wand,et al.  Motion Dynamics Improve Speaker-Independent Lipreading , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Naomi Harte,et al.  Can DNNs Learn to Lipread Full Sentences? , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[3]  Tetsuya Ogata,et al.  Lipreading using convolutional neural network , 2014, INTERSPEECH.

[4]  Jenq-Neng Hwang,et al.  Lipreading from color video , 1997, IEEE Trans. Image Process..

[5]  S. Huffel,et al.  Non-EEG seizure-detection systems and potential SUDEP prevention: State of the art , 2013, Seizure.

[6]  Xavier Serra,et al.  Freesound technical demo , 2013, ACM Multimedia.

[7]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[8]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[9]  Dezhong Yao,et al.  EEG/fMRI fusion based on independent component analysis: integration of data-driven and model-driven methods. , 2012, Journal of integrative neuroscience.

[10]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[11]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Dorothea Kolossa,et al.  Learning Dynamic Stream Weights For Coupled-HMM-Based Audio-Visual Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[14]  Eric D. Petajan Automatic lipreading to enhance speech recognition , 1984 .

[15]  Robert M. Nickel,et al.  Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR , 2016, INTERSPEECH.

[16]  Jürgen Schmidhuber,et al.  Lipreading with long short-term memory , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[18]  Naomi Harte,et al.  Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition , 2018, ICMI.

[19]  Maja Pantic,et al.  Deep complementary bottleneck features for visual speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Tanja Schultz,et al.  Biosignal-Based Spoken Communication: A Survey , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[23]  Jürgen Schmidhuber,et al.  Improving Speaker-Independent Lipreading with Domain-Adversarial Training , 2017, INTERSPEECH.

[24]  Petr Motlícek,et al.  A Large-Scale Open-Source Acoustic Simulator for Speaker Recognition , 2016, IEEE Signal Processing Letters.

[25]  Matthew Turk,et al.  Multimodal interaction: A review , 2014, Pattern Recognit. Lett..

[26]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[27]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[28]  Christian Jutten,et al.  Multimodal Data Fusion: An Overview of Methods, Challenges, and Prospects , 2015, Proceedings of the IEEE.

[29]  Mirela C. Popa,et al.  Multimodal fusion based on information gain for emotion recognition in the wild , 2017, 2017 Intelligent Systems Conference (IntelliSys).

[30]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[31]  B.P. Yuhas,et al.  Integration of acoustic and visual speech signals using neural networks , 1989, IEEE Communications Magazine.

[32]  Ahmed Hussen Abdelaziz Comparing Fusion Models for DNN-Based Audiovisual Continuous Speech Recognition , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  Ngoc Thang Vu,et al.  Investigations on End- to-End Audiovisual Fusion , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).