Automatic annotation of unique locations from video and text

Given a video and associated text, we propose an automatic annotation scheme in which we employ a latent topic model to generate topic distributions from weighted text and then modify these distributions based on visual similarity. We apply this scheme to location annotation of a television series for which transcripts are available. The topic distributions allow us to avoid explicit classification, which is useful in cases where the exact number of locations is unknown. Moreover, many locations are unique to a single episode, making it impossible to obtain representative training data for a supervised approach. Our method first segments the episode into scenes by fusing cues from both images and text. We then assign location-oriented weights to the text and generate topic distributions for each scene using Latent Dirichlet Allocation. Finally, we update the topic distributions using the distributions of visually similar scenes. We formulate our visual similarity between scenes as an Earth Mover’s Distance problem. We quantitatively validate our multi-modal approach to segmentation and qualitatively evaluate the resulting location annotations. Our results demonstrate that we are able to generate accurate annotations, even for locations only seen in a single episode.

[1]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[2]  Daniel Gatica-Perez,et al.  On image auto-annotation with latent space models , 2003, ACM Multimedia.

[3]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5]  Yuncai Liu,et al.  Video scene segmentation and semantic representation using a novel scheme , 2009, Multimedia Tools and Applications.

[6]  Mubarak Shah,et al.  A general framework for temporal video scene segmentation , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[7]  Andrew Zisserman,et al.  Progressive search space reduction for human pose estimation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Bo Zhang,et al.  A Formal Study of Shot Boundary Detection , 2007, IEEE Transactions on Circuits and Systems for Video Technology.

[9]  Jean Ponce,et al.  Automatic annotation of human actions in video , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[10]  Marie-Francine Moens,et al.  Semi-supervised Semantic Role Labeling Using the Latent Words Language Model , 2009, EMNLP.

[11]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[12]  Ben Taskar,et al.  Movie/Script: Alignment and Parsing of Video and Text Transcription , 2008, ECCV.

[13]  Andrei Mikheev,et al.  Automatic Rule Induction for Unknown-Word Guessing , 1997, CL.

[14]  Cordelia Schmid,et al.  TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[15]  Li Fei-Fei,et al.  Towards total scene understanding: Classification, annotation and segmentation in an automatic framework , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[17]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[18]  Chong-Wah Ngo,et al.  EMD-Based Video Clip Retrieval by Many-to-Many Matching , 2005, CIVR.

[19]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[20]  Andrew Zisserman,et al.  Automated location matching in movies , 2003, Comput. Vis. Image Underst..

[21]  Bernt Schiele,et al.  International Journal of Computer Vision manuscript No. (will be inserted by the editor) Semantic Modeling of Natural Scenes for Content-Based Image Retrieval , 2022 .

[22]  Langis Gagnon,et al.  Key-Places Detection and Clustering in Movies Using Latent Aspects , 2007, 2007 IEEE International Conference on Image Processing.

[23]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.