A self-calibrating algorithm for speaker tracking based on audio-visual statistical models

We present a self-calibrating algorithm for audio-visual tracking using two microphones and a camera. The algorithm uses a parametrized statistical model which combines simple models of video and audio. Using unobserved variables, the model describes the process that generates the observed data. Hence, it is able to capture and exploit the statistical structure of the audio and video data, as well as their mutual dependencies, The model parameters are estimated by the EM algorithm; object templates are learned and automatic calibration is performed as part of this procedure. Tracking is done by Bayesian inference of the object location using the model. Successful performance is demonstrated on real multimedia clips.

[1]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Li Deng,et al.  A new method for speech denoising and robust speech recognition using probabilistic models for clean speech and for noise , 2001, INTERSPEECH.

[3]  Hong Wang,et al.  Voice source localization for automatic camera pointing system in videoconferencing , 1997, Proceedings of 1997 Workshop on Applications of Signal Processing to Audio and Acoustics.

[4]  M S Brandstein Time-delay estimation of reverberated speech exploiting harmonic structure. , 1999, The Journal of the Acoustical Society of America.

[5]  Brendan J. Frey,et al.  Learning flexible sprites in video layers , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[6]  Brendan J. Frey,et al.  Fast, Large-Scale Transformation-Invariant Clustering , 2001, NIPS.

[7]  A. Blake,et al.  Sequential Monte Carlo fusion of sound and vision for speaker tracking , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[8]  Patrick Pérez,et al.  Sequential Monte Carlo Fusion of Sound and Vision for Speaker Tracking , 2001, ICCV.

[9]  Christoph E. Schreiner,et al.  Blind source separation and deconvolution: the dynamic component analysis algorithm , 1998 .

[10]  Brendan J. Frey,et al.  Estimating mixture models of images and inferring spatial transformations using the EM algorithm , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).