Vision-based speaker detection using Bayesian networks

The development of user interfaces based on vision and speech requires the solution of a challenging statistical inference problem: The intentions and actions of multiple individuals must be inferred from noisy and ambiguous data. We argue that Bayesian network models are an attractive statistical framework for cue fusion in these applications. Bayes nets combine a natural mechanism for expressing contextual information with efficient algorithms for learning and inference. We illustrate these points through the development of a Bayes net model for detecting when a user is speaking. The model combines four simple vision sensors: face detection, skin color, skin texture, and mouth motion. We present some promising experimental results.

[1]  Kevin P. Murphy,et al.  Inference and Learning in Hybrid Bayesian Networks , 1998 .

[2]  Alexander H. Waibel,et al.  A real-time face tracker , 1996, Proceedings Third IEEE Workshop on Applications of Computer Vision. WACV'96.

[3]  Intille,et al.  Representation and Visual Recognition of Complex , Multi-agent Actions using Belief , 1998 .

[4]  H. Buxton,et al.  Advanced visual surveillance using Bayesian networks , 1997 .

[5]  James M. Rehg,et al.  Vision for a smart kiosk , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Alex Pentland,et al.  Automatic lipreading by optical-flow analysis , 1989 .

[7]  Gregory D. Hager,et al.  Incremental focus of attention for robust visual tracking , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8]  Nuno Vasconcelos,et al.  A Bayesian framework for semantic content characterization , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[9]  Avi Pfeffer,et al.  Object-Oriented Bayesian Networks , 1997, UAI.

[10]  Eric Horvitz,et al.  The Lumière Project: Bayesian User Modeling for Inferring the Goals and Needs of Software Users , 1998, UAI.

[11]  Thomas M. Strat,et al.  Context-Based Vision: Recognizing Objects Using Information from Both 2D and 3D Imagery , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Vladimir Pavlovic,et al.  Time-series classification using mixed-state dynamic Bayesian networks , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[13]  James M. Rehg,et al.  Computer Vision for Human–Machine Interaction: Visual Sensing of Humans for Active Public Interfaces , 1998 .

[14]  Alex Pentland,et al.  Coupled hidden Markov models for complex action recognition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[15]  Trevor Darrell,et al.  A virtual mirror interface using real-time robust face tracking , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[16]  Alex Pentland,et al.  LAFTER: lips and face real time tracker , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[17]  James W. Davis,et al.  Real-time closed-world tracking , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Andrew D. Christian,et al.  Digital smart kiosk project , 1998, CHI.

[19]  James L. Crowley,et al.  Coordination of perceptual processes for computer mediated communication , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[20]  Takeo Kanade,et al.  Neural Network-Based Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .