Dynamic Dependency Tests for Audio-Visual Speaker Association

We formulate the problem of audio-visual speaker association as a dynamic dependency test. That is, given an audio stream and multiple video streams, we wish to determine their dependency structure as it evolves over time. To this end, we propose the use of a hidden factorization Markov model in which the hidden state encodes a finite number of possible dependency structures. Each dependency structure has an explicit semantic meaning, namely "who is speaking". This model takes advantage of both structural and parametric changes associated with changes in speaker. This is contrasted with standard sliding window based dependence analysis. Using this model we obtain state-of-the-art performance on an audio-visual association task without benefit of training data.

[1]  Harriet J. Nock,et al.  Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study , 2003, CIVR.

[2]  John W. Fisher,et al.  Nonparametric hypothesis tests for statistical dependency , 2004, IEEE Transactions on Signal Processing.

[3]  Stuart J. Russell,et al.  Approximate Inference for Infinite Contingent Bayesian Networks , 2005, AISTATS.

[4]  Trevor Darrell,et al.  Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[5]  Pierre Vandergheynst,et al.  Experimental evaluation framework for speaker detection on the CUAVE database , 2006 .

[6]  Malcolm Slaney,et al.  FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.

[7]  Jeff A. Bilmes,et al.  Dynamic Bayesian Multinets , 2000, UAI.

[8]  Craig Boutilier,et al.  Context-Specific Independence in Bayesian Networks , 1996, UAI.

[9]  Jean-Philippe Thiran,et al.  Multimodal speaker localization in a probabilistic framework , 2006, 2006 14th European Signal Processing Conference.

[10]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[11]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[12]  David Heckerman,et al.  Knowledge Representation and Inference in Similarity Networks and Bayesian Multinets , 1996, Artif. Intell..

[13]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.