Fast unsupervised alignment of video and text for indexing/names and faces

We propose a novel way of combining weakly associated video/audio and text steams in an unsupervised manner which is faster than conventional speech recognition. The technique aligns audio/video and text streams which will enable video search using the associated text. Multimedia of this form includes news broadcast with summaries, parliament proceedings and court trials with transcripts, sports telecast with text commentary, etc. We also show how we can annotate the video with the names of the person appearing in the video which will allow name based indexing/search. We test the technique on a 80 minute video segment downloaded from the website of the International Court of the Former Yugoslavia, with the corresponding transcripts. The proposed technique achieves 88.49% accuracy on sentence level alignments and 95.5% accuracy on the task of assigning names to faces.

[1]  Yee Whye Teh,et al.  Names and faces in the news , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[2]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[3]  Andrew Zisserman,et al.  Automatic face recognition for film character retrieval in feature-length films , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[4]  Tomás Lozano-Pérez,et al.  A Framework for Multiple-Instance Learning , 1997, NIPS.

[5]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[6]  Kobus Barnard,et al.  Word Sense Disambiguation with Pictures , 2003, Artif. Intell..

[7]  Qi Zhang,et al.  EM-DD: An Improved Multiple-Instance Learning Technique , 2001, NIPS.

[8]  Ming-Hsuan Yang,et al.  Kernel Eigenfaces vs. Kernel Fisherfaces: Face recognition using kernel methods , 2002, Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition.

[9]  Andreas Stolcke,et al.  An efficient repair procedure for quick transcriptions , 2004, INTERSPEECH.

[10]  Cordelia Schmid,et al.  Face detection in a video sequence - a temporal approach , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[11]  Oded Maron,et al.  Multiple-Instance Learning for Natural Scene Classification , 1998, ICML.

[12]  Jitendra Malik,et al.  Blobworld: Image Segmentation Using Expectation-Maximization and Its Application to Image Querying , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Paul A. Viola,et al.  Robust Real-time Object Detection , 2001 .

[14]  Michael Isard,et al.  CONDENSATION—Conditional Density Propagation for Visual Tracking , 1998, International Journal of Computer Vision.