Boosted learning in dynamic Bayesian networks for multimodal speaker detection

Bayesian network models provide an attractive framework for multimodal sensor fusion. They combine an intuitive graphical representation with efficient algorithms for inference and learning. However, the unsupervised nature of standard parameter learning algorithms for Bayesian networks can lead to poor performance in classification tasks. We have developed a supervised learning framework for Bayesian networks, which is based on the Adaboost algorithm of Schapire and Freund. Our framework covers static and dynamic Bayesian networks with both discrete and continuous states. We have tested our framework in the context of a novel multimodal HCI application: a speech-based command and control interface for a Smart Kiosk. We provide experimental evidence for the utility of our boosted learning approach.

[1]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[2]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[3]  Ashutosh B Tech Learning Dynamic Bayesian Networks for Multimodal Speaker Detection I Thank My Colleagues at Image Formation and Processing Group for Providing a Great Table of Contents , 2000 .

[4]  Alex Pentland,et al.  LAFTER: lips and face real time tracker , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Jaime G. Carbonell,et al.  Machine learning research , 1981, SGAR.

[6]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[7]  Larry S. Davis,et al.  Look who's talking: speaker detection using video and audio correlation , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[8]  Biing-Hwang Juang,et al.  Discriminative learning for minimum error classification [pattern recognition] , 1992, IEEE Trans. Signal Process..

[9]  Holger Schwenk,et al.  Using boosting to improve a hybrid HMM/neural network speech recognizer , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[10]  James M. Rehg,et al.  Statistical Color Models with Application to Skin Detection , 2004, International Journal of Computer Vision.

[11]  Avi Pfeffer,et al.  Object-Oriented Bayesian Networks , 1997, UAI.

[12]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[13]  Andrew D. Christian,et al.  Digital smart kiosk project , 1998, CHI.

[14]  Bhiksha Raj,et al.  A boosting approach for confidence scoring , 2001, INTERSPEECH.

[15]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[16]  Robert E. Schapire,et al.  A Brief Introduction to Boosting , 1999, IJCAI.

[17]  Vladimir Pavlovic,et al.  Audio-visual speaker detection using dynamic Bayesian networks , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[18]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[19]  Aaron E. Rosenberg,et al.  Speaker identification using minimum classification error training , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[20]  Trevor Darrell,et al.  Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[21]  Alex Pentland,et al.  On Reversing Jensen's Inequality , 2000, NIPS.

[22]  Lalit R. Bahl,et al.  Estimating hidden Markov model parameters so as to maximize speech recognition accuracy , 1993, IEEE Trans. Speech Audio Process..

[23]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[24]  Patrick Pérez,et al.  Sequential Monte Carlo Fusion of Sound and Vision for Speaker Tracking , 2001, ICCV.

[25]  Takeo Kanade,et al.  Neural Network-Based Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  James M. Rehg,et al.  Vision-based speaker detection using Bayesian networks , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[27]  Alexander H. Waibel,et al.  A real-time face tracker , 1996, Proceedings Third IEEE Workshop on Applications of Computer Vision. WACV'96.

[28]  Vladimir Pavlovic,et al.  Multimodal speaker detection using error feedback dynamic Bayesian networks , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[29]  Nebojsa Jojic,et al.  A self-calibrating algorithm for speaker tracking based on audio-visual statistical models , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Michael I. Jordan,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1994, Neural Computation.

[31]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[32]  Chuan Long,et al.  Boosting Noisy Data , 2001, ICML.

[33]  James M. Rehg,et al.  Vision for a smart kiosk , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.