Noise-Robust Multimodal Audio-Visual Speech Recognition System for Speech-Based Interaction Applications

Speech is a commonly used interaction-recognition technique in edutainment-based systems and is a key technology for smooth educational learning and user–system interaction. However, its application to real environments is limited owing to the various noise disruptions in real environments. In this study, an audio and visual information-based multimode interaction system is proposed that enables virtual aquarium systems that use speech to interact to be robust to ambient noise. For audio-based speech recognition, a list of words recognized by a speech API is expressed as word vectors using a pretrained model. Meanwhile, vision-based speech recognition uses a composite end-to-end deep neural network. Subsequently, the vectors derived from the API and vision are classified after concatenation. The signal-to-noise ratio of the proposed system was determined based on data from four types of noise environments. Furthermore, it was tested for accuracy and efficiency against existing single-mode strategies for extracting visual features and audio speech recognition. Its average recognition rate was 91.42% when only speech was used, and improved by 6.7% to 98.12% when audio and visual information were combined. This method can be helpful in various real-world settings where speech recognition is regularly utilized, such as cafés, museums, music halls, and kiosks.

[1]  Hao Li,et al.  Learning the Relative Dynamic Features for Word-Level Lipreading , 2022, Sensors.

[2]  S. Jeon,et al.  End-to-End Lip-Reading Open Cloud-Based Speech Architecture , 2022, Sensors.

[3]  Perry Xiao,et al.  An Effective Conversion of Visemes to Words for High-Performance Automatic Lipreading , 2021, Sensors.

[4]  Petros Maragos,et al.  A robotic edutainment framework for designing child-robot interaction scenarios , 2021, PETRA.

[5]  Naoyuki Kubota,et al.  Lifelong Robot Edutainment based on Self-Efficacy , 2021, 2021 5th IEEE International Conference on Cybernetics (CYBCONF).

[6]  Gwang Yong Gim,et al.  The Performance Evaluation of Continuous Speech Recognition Based on Korean Phonological Rules of Cloud-Based Speech Recognition Open API , 2021, Int. J. Networked Distributed Comput..

[7]  Matthias Schultalbers,et al.  Speech recognition system for a service robot - a performance evaluation , 2020, 2020 16th International Conference on Control, Automation, Robotics and Vision (ICARCV).

[8]  Chulhee Lee,et al.  Revisiting spatial dropout for regularizing convolutional neural networks , 2020, Multimedia Tools and Applications.

[9]  Mauro Castelli,et al.  The effect of batch size on the generalizability of the convolutional neural networks on a histopathology dataset , 2020, ICT Express.

[10]  Ornella Mich,et al.  Framing the Design Space of Multimodal Mid-Air Gesture and Speech-Based Interaction With Mobile Devices for Older People , 2020, Int. J. Mob. Hum. Comput. Interact..

[11]  Michal Vavrečka,et al.  Edutainment Software for the Pepper Robot , 2019 .

[12]  Daniel Hepperle,et al.  2D, 3D or speech? A case study on which user interface is preferable for what kind of object interaction in immersive virtual reality , 2019, Comput. Graph..

[13]  Kuo-Hsing Cheng,et al.  A Sketch Classifier Technique with Deep Learning Models Realized in an Embedded System , 2019, 2019 IEEE 22nd International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS).

[14]  Von-Wun Soo,et al.  AI Applications on Music Technology for Edutainment , 2018, ICITL.

[15]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[16]  Haslina Arshad,et al.  User Satisfaction for an Augmented Reality Application to Support Productive Vocabulary Using Speech Recognition , 2018, Adv. Multim..

[17]  Carlo Luschi,et al.  Revisiting Small Batch Training for Deep Neural Networks , 2018, ArXiv.

[18]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[19]  Kai Xu,et al.  LCANet: End-to-End Lipreading with Cascaded Attention-CTC , 2018, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[20]  Andreas Stolcke,et al.  The Microsoft 2017 Conversational Speech Recognition System , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  In So Kweon,et al.  Convolutional Block Attention Module , 2018, ECCV 2018.

[22]  Yang You,et al.  Large Batch Training of Convolutional Networks , 2017, 1708.03888.

[23]  Veton Kepuska,et al.  Comparing Speech Recognition Systems (Microsoft API, Google API And CMU Sphinx) , 2017 .

[24]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[25]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Shimon Whiteson,et al.  LipNet: End-to-End Sentence-level Lipreading , 2016, 1611.01599.

[27]  Maja Pantic,et al.  Deep complementary bottleneck features for visual speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Jürgen Schmidhuber,et al.  Lipreading with long short-term memory , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Celia Woolf,et al.  Using voice recognition software to improve communicative writing and social participation in an individual with severe acquired dysgraphia: an experimental single-case therapy study , 2015 .

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Jonathan Tompson,et al.  Efficient object localization using Convolutional Networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Jonathan H. Venezia,et al.  Multisensory Integration and Audiovisual Speech Perception , 2015 .

[33]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[34]  Tetsuya Ogata,et al.  Lipreading using convolutional neural network , 2014, INTERSPEECH.

[35]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[36]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[37]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Kathrin Janowski,et al.  Gestures or speech? Comparing modality selection for different interaction tasks in a virtual environment , 2013 .

[39]  Riccardo Berta,et al.  Assessment in and of Serious Games: An Overview , 2013, Adv. Hum. Comput. Interact..

[40]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[41]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[42]  Léon Bottou,et al.  Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.

[43]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[44]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[45]  Yılmaz Kara,et al.  Comparing the Impacts of Tutorial and Edutainment Software Programs on Students’ Achievements, Misconceptions, and Attitudes towards Biology , 2008 .

[46]  Ruth Campbell,et al.  The processing of audio-visual speech: empirical and neural bases , 2008, Philosophical Transactions of the Royal Society B: Biological Sciences.

[47]  Kok Wai Wong,et al.  Similarities and differences between learn through play and edutainment , 2006 .

[48]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[49]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[50]  Jeffery A. Jones,et al.  Brain activity during audiovisual speech perception: An fMRI study of the McGurk effect , 2003, Neuroreport.

[51]  L. Bernstein,et al.  Single-channel vibrotactile supplements to visual perception of intonation and stress. , 1989, The Journal of the Acoustical Society of America.

[52]  P K Kuhl,et al.  The contribution of fundamental frequency, amplitude envelope, and voicing duration cues to speechreading in normal-hearing subjects. , 1985, The Journal of the Acoustical Society of America.

[53]  Barbara Dodd,et al.  The Role of Vision in the Perception of Speech , 1977, Perception.

[54]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[55]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[56]  R. Réa,et al.  A Speech , 1869, The Dental register.