论文信息 - Audio-Visual Tibetan Speech Recognition Based on a Deep Dynamic Bayesian Network for Natural Human Robot Interaction:

Audio-Visual Tibetan Speech Recognition Based on a Deep Dynamic Bayesian Network for Natural Human Robot Interaction:

Audio-visual speech recognition is a natural and robust approach to improving human-robot interaction in noisy environments. Although multi-stream Dynamic Bayesian Network and coupled HMM are widely used for audio-visual speech recognition, they fail to learn the shared features between modalities and ignore the dependency of features among the frames within each discrete state. In this paper, we propose a Deep Dynamic Bayesian Network (DDBN) to perform unsupervised extraction of spatial-temporal multimodal features from Tibetan audio-visual speech data and build an accurate audio-visual speech recognition model under a no frame-independency assumption. The experiment results on Tibetan speech data from some real-world environments showed the proposed DDBN outperforms the state-of-art methods in word recognition accuracy.

[1] Stuart J. Russell,et al. Dynamic bayesian networks: representation, inference and learning , 2002 .

[2] Qiang Ji,et al. Efficient Structure Learning of Bayesian Networks using Constraints , 2011, J. Mach. Learn. Res..

[3] Geoffrey E. Hinton,et al. Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine , 2010, NIPS.

[4] Honglak Lee,et al. Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[5] Nir Friedman,et al. The Bayesian Structural EM Algorithm , 1998, UAI.

[6] Kate Saenko,et al. AN ASYNCHRONOUS DBN FOR AUDIO-VISUAL SPEECH RECOGNITION , 2006, 2006 IEEE Spoken Language Technology Workshop.

[7] Alex Pentland,et al. Coupled hidden Markov models for complex action recognition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[9] Kevin P. Murphy,et al. A coupled HMM for audio-visual speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10] Jeff A. Bilmes,et al. DBN based multi-stream models for speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[11] P. Ladefoged,et al. Factor analysis of tongue shapes. , 1971, Journal of the Acoustical Society of America.

[12] Jeff A. Bilmes,et al. DBN based multi-stream models for audio-visual speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13] Qiang Ji,et al. Learning discriminant features for multi-view face and eye detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[14] Mark Hasegawa-Johnson,et al. Acoustic segmentation using switching state Kalman filter , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[15] Lv Guo-yun. Research on DBN-based continuous speech recognition and phoneme segment , 2007 .

[16] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17] Yang Wang,et al. Robust facial feature tracking under varying face pose and facial expression , 2007, Pattern Recognit..