Improving speech recognition with audio-visual tandem classifiers and their fusions

“Tandem approach” is a method used in speech recognition to increase performance by using classifier posterior probabilities as observations in a hidden Markov model. In this work we study the effect of using multiple visual tandem features to improve audio-visual recognition accuracy. In addition, we investigate methods to combine outputs of several audio and visual tandem classifiers with a classifier fusion system to generate outputs using learned weights. Experiments show that both approaches help to improve audio-visual speech recognition with respect to regular audio-visual speech recognition especially in noisy environments.

[1]  P. Mermelstein,et al.  Distance measures for speech recognition, psychological and instrumental , 1976 .

[2]  Hakan Erdogan,et al.  A Unifying Framework for Learning the Linear Combiners for Classifier Ensembles , 2010, 2010 20th International Conference on Pattern Recognition.

[3]  Alan F. Smeaton,et al.  Thermo-visual feature fusion for object tracking using multiple spatiogram trackers , 2007 .

[4]  Jing Zhang,et al.  Framework for Performance Evaluation of Face, Text, and Vehicle Detection and Tracking in Video: Data, Metrics, and Protocol , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Luc Vandendorpe,et al.  The M2VTS Multimodal Face Database (Release 1.00) , 1997, AVBPA.

[7]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[8]  Christian Kohlschein An introduction to Hidden Markov Models , 2007 .

[9]  Ramesh A. Gopinath,et al.  Maximum likelihood modeling with Gaussian distributions for classification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[10]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[11]  Y.-J. Yeh,et al.  Online Selection of Tracking Features Using AdaBoost , 2009, IEEE Trans. Circuits Syst. Video Technol..

[12]  Stephen Milborrow The MUCT Landmarked Face Database , 2010 .

[13]  Horst Bischof,et al.  On-Line Multi-view Forests for Tracking , 2010, DAGM-Symposium.

[14]  Shai Avidan,et al.  Ensemble Tracking , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[16]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[17]  Larry D. Hostetler,et al.  The estimation of the gradient of a density function, with applications in pattern recognition , 1975, IEEE Trans. Inf. Theory.

[18]  Weiwei Zhang,et al.  On-Line Ensemble SVM for Robust Object Tracking , 2007, ACCV.

[19]  C. Taylor,et al.  Active shape models - 'Smart Snakes'. , 1992 .

[20]  Gary Bradski,et al.  Computer Vision Face Tracking For Use in a Perceptual User Interface , 1998 .

[21]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[22]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[23]  Juergen Luettin,et al.  Audio-Visual Speech Modelling for Continuous Speech Recognition , 2000 .

[24]  Joseph Picone,et al.  Support vector machines for speech recognition , 1998, ICSLP.

[25]  Hakan Erdogan,et al.  Improving Gaussian Mixture Model based Adaptive Background Modeling using Hysteresis Thresholding , 2007, 2007 IEEE 15th Signal Processing and Communications Applications.

[26]  Horst Bischof,et al.  Efficient Tracking as Linear Program on Weak Binary Classifiers , 2008, DAGM-Symposium.

[27]  Yanxi Liu,et al.  Online Selection of Discriminative Tracking Features , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Fred Nicolls,et al.  Locating Facial Features with an Extended Active Shape Model , 2008, ECCV.

[29]  W. Eric L. Grimson,et al.  Adaptive background mixture models for real-time tracking , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).