论文信息 - Improving speech recognition with audio-visual tandem classifiers and their fusions

Improving speech recognition with audio-visual tandem classifiers and their fusions

“Tandem approach” is a method used in speech recognition to increase performance by using classifier posterior probabilities as observations in a hidden Markov model. In this work we study the effect of using multiple visual tandem features to improve audio-visual recognition accuracy. In addition, we investigate methods to combine outputs of several audio and visual tandem classifiers with a classifier fusion system to generate outputs using learned weights. Experiments show that both approaches help to improve audio-visual speech recognition with respect to regular audio-visual speech recognition especially in noisy environments.

[1] P. Mermelstein,et al. Distance measures for speech recognition, psychological and instrumental , 1976 .

[2] Hakan Erdogan,et al. A Unifying Framework for Learning the Linear Combiners for Classifier Ensembles , 2010, 2010 20th International Conference on Pattern Recognition.

[3] Alan F. Smeaton,et al. Thermo-visual feature fusion for object tracking using multiple spatiogram trackers , 2007 .

[4] Jing Zhang,et al. Framework for Performance Evaluation of Face, Text, and Vehicle Detection and Tracking in Video: Data, Metrics, and Protocol , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5] Dorin Comaniciu,et al. Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[6] Luc Vandendorpe,et al. The M2VTS Multimodal Face Database (Release 1.00) , 1997, AVBPA.

[7] Mark J. F. Gales,et al. Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[8] Christian Kohlschein. An introduction to Hidden Markov Models , 2007 .

[9] Ramesh A. Gopinath,et al. Maximum likelihood modeling with Gaussian distributions for classification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[10] Juergen Luettin,et al. Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[11] Y.-J. Yeh,et al. Online Selection of Tracking Features Using AdaBoost , 2009, IEEE Trans. Circuits Syst. Video Technol..

[12] Stephen Milborrow. The MUCT Landmarked Face Database , 2010 .

[13] Horst Bischof,et al. On-Line Multi-view Forests for Tracking , 2010, DAGM-Symposium.

[14] Shai Avidan,et al. Ensemble Tracking , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15] J. Friedman. Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[16] David H. Wolpert,et al. Stacked generalization , 1992, Neural Networks.

[17] Larry D. Hostetler,et al. The estimation of the gradient of a density function, with applications in pattern recognition , 1975, IEEE Trans. Inf. Theory.

[18] Weiwei Zhang,et al. On-Line Ensemble SVM for Robust Object Tracking , 2007, ACCV.

[19] C. Taylor,et al. Active shape models - 'Smart Snakes'. , 1992 .

[20] Gary Bradski,et al. Computer Vision Face Tracking For Use in a Perceptual User Interface , 1998 .

[21] Fu Jie Huang,et al. A Tutorial on Energy-Based Learning , 2006 .

[22] Daniel P. W. Ellis,et al. Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[23] Juergen Luettin,et al. Audio-Visual Speech Modelling for Continuous Speech Recognition , 2000 .

[24] Joseph Picone,et al. Support vector machines for speech recognition , 1998, ICSLP.

[25] Hakan Erdogan,et al. Improving Gaussian Mixture Model based Adaptive Background Modeling using Hysteresis Thresholding , 2007, 2007 IEEE 15th Signal Processing and Communications Applications.

[26] Horst Bischof,et al. Efficient Tracking as Linear Program on Weak Binary Classifiers , 2008, DAGM-Symposium.

[27] Yanxi Liu,et al. Online Selection of Discriminative Tracking Features , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[28] Fred Nicolls,et al. Locating Facial Features with an Extended Active Shape Model , 2008, ECCV.

[29] W. Eric L. Grimson,et al. Adaptive background mixture models for real-time tracking , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).