Speaker Adaptive Audio-Visual Fusion for the Open-Vocabulary Section of AVICAR

This experimental study establishes the first audio-visual speech recognition baseline for the TIMIT sentence portion of the AVICAR dataset, a dataset recorded in a real, noisy car environment. We use an automatic speech recognizer trained on a larger dataset to generate an audio-only recognition baseline for AVICAR. We utilize the forced alignment of the audio modality of AVICAR to get training targets for the convolutional neural network based visual front end. Based on our observation that there is a great amount of variation between visual features of different speakers, we apply feature space maximum likelihood linear regression (fMMLR) based speaker adaptation to the visual features. We find that the quality of fMLLR is sensitive to the quality of the alignment probabilities used to compute it; experimental tests compare the quality of fMLLR trained using audio-visual versus audio-only alignment probabilities. We report the first audio-visual results for TIMIT subset of AVICAR and show that the word error rate of the proposed audio-visual system is significantly better than that of the audio-only system.

[1]  Vaibhava Goel,et al.  Deep multimodal learning for Audio-Visual Speech Recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Victor Zue,et al.  Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[3]  Ray A. Jarvis,et al.  Clustering Using a Similarity Measure Based on Shared Near Neighbors , 1973, IEEE Transactions on Computers.

[4]  Stephen J. Cox,et al.  Improved speaker independent lip reading using speaker adaptive training and deep neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[6]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[7]  Sridha Sridharan,et al.  Recognising audio-visual speech in vehicles using the AVICAR database , 2010 .

[8]  Dongsuk Yook,et al.  Audio-to-Visual Conversion Using Hidden Markov Models , 2002, PRICAI.

[9]  Mark Hasegawa-Johnson,et al.  Robust Speech Recognition in a Car Using a Microphone Array , 2006 .

[10]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[11]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[12]  Ming Liu,et al.  AVICAR: audio-visual speech corpus in a car environment , 2004, INTERSPEECH.

[13]  Jing Huang,et al.  Rapid Feature Space Speaker Adaptation for Multi-Stream HMM-Based Audio-Visual Speech Recognition , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[14]  Jing Huang,et al.  Audio-visual deep learning for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Tetsuya Ogata,et al.  Audio-visual speech recognition using deep learning , 2014, Applied Intelligence.

[16]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[17]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[18]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[20]  Naomi Harte,et al.  Viseme definitions comparison for visual-only speech recognition , 2011, 2011 19th European Signal Processing Conference.

[21]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  L. Venkata Subramaniam,et al.  Large vocabulary audio-visual speech recognition using active shape models , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[23]  Yun Fu,et al.  Lipreading by Locality Discriminant Graph , 2007, 2007 IEEE International Conference on Image Processing.