An audio-visual fusion framework with joint dimensionality reducton

By combining audio and visual modalities, the speech recognition systems achieve higher performance and robustness. The fusion strategies to this point are mainly three types: feature level fusion, model level fusion, and decision level fusion. In this paper, we present a novel audio-visual fusion framework, in which a joint dimensionality reduction approach is used to project the audio and visual features into more compact subspaces. With correlation preserving criteria, the representations of projected audio and visual features will be able to preserve the correlation conveyed in the original audio and visual feature space. At the same time, the better model efficiency is achieved in the more compact feature spaces. The experiments on audio-visual person verification demonstrate the efficiency and effectiveness of the proposed fusion framework.

[1]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[2]  Juergen Luettin,et al.  A comparison of model and transform-based visual features for audio-visual LVCSR , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[3]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[4]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[5]  Bianzhang Yu,et al.  Direction finding using interpolated arrays in unknown noise fields , 1997, Signal Process..

[6]  A. Adjoudani,et al.  On the Integration of Auditory and Visual Parameters in an HMM-based ASR , 1996 .

[7]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[8]  Ziyou Xiong,et al.  Audio visual word spotting , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[10]  Enrique Muñoz,et al.  Efficient Appearance-Based Tracking , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[11]  Thomas S. Huang,et al.  Audio-visual speech modeling using coupled hidden Markov models , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Thomas S. Huang,et al.  An experimental study of coupled hidden Markov models , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[14]  Suzanna Becker,et al.  Mutual information maximization: models of cortical self-organization. , 1996, Network.

[15]  Paul W. Fieguth,et al.  Multiresolution model development for overlapping trees via canonical correlation analysis , 1995, Proceedings., International Conference on Image Processing.

[16]  Jim Kay,et al.  Feature discovery under contextual supervision using mutual information , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[17]  H. Hotelling Relations Between Two Sets of Variates , 1936 .