Scope for Deep Learning: A Study in Audio-Visual Speech Recognition

Audio-visual signal processing has been emerged as an important research area in multimodal signal processing. Among this, speech recognition from audio-visual signal is gained more attention because of unavailability of efficient acoustic noise removal methods. In this paper, the literature review on audio-visual speech recognition system has been presented. The different types of audio and visual feature extraction mechanisms are also discussed here. The various classification models have also been discussed along with this review.

[1]  Jianwu Dang,et al.  Audio-visual speech recognition integrating 3D lip information obtained from the Kinect , 2016, Multimedia Systems.

[2]  Petros Maragos,et al.  Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Alice Caplier,et al.  Lip contour segmentation and tracking compliant with lip-reading application constraints , 2012, Machine Vision and Applications.

[4]  Dinesh Kant Kumar,et al.  Automatic visual speech segmentation and recognition using directional motion history images and Zernike moments , 2013, The Visual Computer.

[5]  Tetsuya Ogata,et al.  Audio-visual speech recognition using deep learning , 2014, Applied Intelligence.

[6]  Farshad Almasganj,et al.  Audio-visual feature fusion via deep neural networks for automatic speech recognition , 2018, Digit. Signal Process..

[7]  Sushila Maheshkar,et al.  Visual Speech Recognition Using Optical Flow and Hidden Markov Model , 2019, Wirel. Pers. Commun..

[8]  Tolga Çiloglu,et al.  Bimodal automatic speech segmentation based on audio and visual information fusion , 2011, Speech Commun..

[9]  Timothy J. Hazen Visual model structures and synchrony constraints for audio-visual speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  S. Palanivel,et al.  Lip reading of hearing impaired persons using HMM , 2011, Expert Syst. Appl..

[11]  Mahesh Chandra,et al.  Multiple camera in car audio-visual speech recognition using phonetic and visemic information , 2015, Comput. Electr. Eng..

[12]  Ahmed Hussen Abdelaziz Comparing Fusion Models for DNN-Based Audiovisual Continuous Speech Recognition , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Hazem M. Abbas,et al.  Improved features and dynamic stream weight adaption for robust Audio-Visual Speech Recognition framework , 2019, Digit. Signal Process..

[14]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Andrzej Czyzewski,et al.  An audio-visual corpus for multimodal automatic speech recognition , 2017, Journal of Intelligent Information Systems.

[16]  Jiri Matas,et al.  XM2VTSDB: The Extended M2VTS Database , 1999 .

[17]  Lucas D. Terissi,et al.  Robust front-end for audio, visual and audio–visual speech classification , 2018, Int. J. Speech Technol..

[18]  Jean-Philippe Thiran,et al.  On Dynamic Stream Weighting for Audio-Visual Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[20]  Jean-Philippe Thiran,et al.  Multi-pose lipreading and audio-visual speech recognition , 2012, EURASIP J. Adv. Signal Process..

[21]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[22]  Juergen Luettin,et al.  Audio-Visual Speech Modelling for Continuous Speech Recognition , 2000 .

[23]  Paul Mineiro,et al.  Robust Sensor Fusion: Analysis and Application to Audio Visual Speech Recognition , 1998, Machine Learning.

[24]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Suprava Patnaik,et al.  A novel lip reading algorithm by using localized ACM and HMM: Tested for digit recognition , 2014 .

[26]  Joon Son Chung,et al.  Lip Reading in Profile , 2017, BMVC.

[27]  Maja Pantic,et al.  Deep complementary bottleneck features for visual speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Andrzej Czyzewski,et al.  A comparative study of English viseme recognition methods and algorithms , 2017, Multimedia Tools and Applications.

[29]  Darryl Stewart,et al.  Robust Audio-Visual Speech Recognition Under Noisy Audio-Video Conditions , 2014, IEEE Transactions on Cybernetics.

[30]  Stephen J. Cox,et al.  The challenge of multispeaker lip-reading , 2008, AVSP.

[31]  Sadaoki Furui,et al.  Multi-Modal Speech Recognition Using Optical-Flow Analysis for Lip Images , 2004, J. VLSI Signal Process..

[32]  Naomi Harte,et al.  TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech , 2015, IEEE Transactions on Multimedia.

[33]  M. Z. Ibrahim,et al.  Geometrical-based lip-reading using template probabilistic multi-dimension dynamic time warping , 2015, J. Vis. Commun. Image Represent..

[34]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).