Audio-Visual Speech Recognition Using Bimodal-Trained Bottleneck Features for a Person with Severe Hearing Loss

In this paper, we propose an audio-visual speech recognition system for a person with an articulation disorder resulting from severe hearing loss. In the case of a person with this type of articulation disorder, the speech style is quite different from those of people without hearing loss that a speaker-independent acoustic model for unimpaired persons is hardly useful for recognizing it. The audio-visual speech recognition system we present in this paper is for a person with severe hearing loss in noisy environments. Although feature integration is an important factor in multimodal speech recognition, it is difficult to integrate efficiently because those features are different intrinsically. We propose a novel visual feature extraction approach that connects the lip image to audio features efficiently, and the use of convolutive bottleneck networks (CBNs) increases robustness with respect to speech fluctuations caused by hearing loss. The effectiveness of this approach was confirmed through word-recognition experiments in noisy environments, where the CBN-based feature extraction method outperformed the conventional methods.

[1]  Tetsuya Takiguchi,et al.  Integration of Metamodel and Acoustic Model for Dysarthric Speech Recognition , 2009, J. Multim..

[2]  Timothy F. Cootes,et al.  Feature Detection and Tracking with Constrained Local Models , 2006, BMVC.

[3]  Nobuo Ezaki,et al.  Text detection from natural scene images: towards a system for visually impaired persons , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[4]  G. Montavon Deep learning for spoken language identification , 2009 .

[5]  Tetsuya Takiguchi,et al.  Dysarthric speech recognition using a convolutive bottleneck network , 2014, 2014 12th International Conference on Signal Processing (ICSP).

[6]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[7]  Ashish Verma,et al.  LATE INTEGRATION IN AUDIO-VISUAL CONTINUOUS SPEECH RECOGNITION , 1999 .

[8]  Yann LeCun,et al.  Learning long‐range vision for autonomous off‐road driving , 2009, J. Field Robotics.

[9]  Vaibhava Goel,et al.  Deep multimodal learning for Audio-Visual Speech Recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[11]  Martin Karafiát,et al.  Convolutive Bottleneck Network features for LVCSR , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[12]  Tetsuya Takiguchi,et al.  Multimodal speech recognition of a person with articulation disorders using AAM and MAF , 2010, 2010 IEEE International Workshop on Multimedia Signal Processing.

[13]  Tetsuya Takiguchi,et al.  Audio-Visual Speech Recognition Using Convolutive Bottleneck Networks for a Person with Severe Hearing Loss , 2015, IPSJ Trans. Comput. Vis. Appl..

[14]  Satoshi Tamura,et al.  Integration of deep bottleneck features for audio-visual speech recognition , 2015, INTERSPEECH.

[15]  Martin J. Russell,et al.  Integrating audio and visual information to provide highly robust speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[16]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[17]  Gerasimos Potamianos,et al.  Discriminative training of HMM stream exponents for audio-visual speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[18]  Christophe Garcia,et al.  text Detection with Convolutional Neural Networks , 2008, VISAPP.

[19]  Ying Wu,et al.  Capturing human hand motion in image sequences , 2002, Workshop on Motion and Video Computing, 2002. Proceedings..

[20]  Christophe Garcia,et al.  Convolutional face finder: a neural architecture for fast and robust face detection , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Tetsuya Takiguchi,et al.  Local-feature-map Integration Using Convolutional Neural Networks for Music Genre Classification , 2012, INTERSPEECH.