EM detection of common origin of multi-modal cues

Content analysis of clips containing people speaking involves processing informative cues coming from different modalities. These cues are usually the words extracted from the audio modality, and the identity of the persons appearing in the video modality of the clip. To achieve efficient assignment of these cues to the person that created them, we propose a Bayesian network model that utilizes the extracted feature characteristics, their relations and their temporal patterns. We use the EM algorithm in which the E-step estimates the expectation of the complete-data log-likelihood with respect to the hidden variables - that is the identity of the speakers and the visible persons. In the M-step , the person models that maximize this expectation are computed. This framework produces excellent results, exhibiting exceptional robustness when dealing with low quality data.

[1]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[2]  Speaker , 1977, Journal of the American Dietetic Association.

[3]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[4]  Alex Pentland,et al.  Face recognition using view-based and modular eigenspaces , 1994, Optics & Photonics.

[5]  Rakesh Dugad,et al.  A Tutorial On Hidden Markov Models , 1996 .

[6]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[7]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[8]  Vapnik,et al.  SVMs for Histogram Based Image Classification , 1999 .

[9]  Trevor Darrell,et al.  Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[10]  Trevor Darrell,et al.  Ausio-visual Segmentation and "The Cocktail Party Effect" , 2000, ICMI.

[11]  Trevor Darrell,et al.  Probabalistic Models and Informative Subspaces for Audiovisual Correspondence , 2002, ECCV.

[12]  Lie Lu,et al.  Speaker change detection and tracking in real-time news broadcasting analysis , 2002, MULTIMEDIA '02.

[13]  Narendra Ahuja,et al.  Detecting Faces in Images: A Survey , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Azriel Rosenfeld,et al.  Face recognition: A literature survey , 2003, CSUR.

[15]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[16]  Harriet J. Nock,et al.  Multimodal processing by finding common cause , 2004, CACM.

[17]  Jonathan G. Fiscus,et al.  The Rich Transcription 2005 Spring Meeting Recognition Evaluation , 2005, MLMI.

[18]  Trevor Darrell,et al.  Visual speech recognition with loosely synchronized feature streams , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[19]  Xue Yan,et al.  iCat: an animated user-interface robot with personality , 2005, AAMAS '05.

[20]  Jie Lin,et al.  A Posterior Union Model with Applications to Robust Speech and Speaker Recognition , 2006, EURASIP J. Adv. Signal Process..

[21]  Ben Kröse,et al.  Cross Entropy for learning in Multi-Modal Streams , 2007 .