Automatic Video Annotation by Mining Speech Transcripts

We describe a model for automatic prediction of text annotations for video data. The speech transcripts of videos, are clustered using an aspect model and keywords are extracted based on aspect distribution. Thus we capture the semantic information available in the video data. This technique for automatic keyword vocabulary construction makes the labelling of video data a very easy task. We then build a video shot vocabulary by utilizing both static images and motion cues. We use a maximum entropy criterion to learn the conditional exponential model by defining constraint features over the shot vocabulary, keyword vocabulary combinations. Our method uses a maximum a posteriori estimate of exponential model to predict the annotations. We evaluate the ability of our model to predict annotations, in terms of mean negative log-likelihood and retrieval performance on the test set. A comparison of exponential model with baseline methods indicates that the results are encouraging.

[1]  Atreyi Kankanhalli,et al.  Automatic partitioning of full-motion video , 1993, Multimedia Systems.

[2]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  R. Manmatha,et al.  Multiple Bernoulli relevance models for image and video annotation , 2004, CVPR 2004.

[4]  Bei Yu,et al.  A cross-collection mixture model for comparative text mining , 2004, KDD.

[5]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[6]  Atsuo Yoshitaka,et al.  A Survey on Content-Based Retrieval for Multimedia Databases , 1999, IEEE Trans. Knowl. Data Eng..

[7]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.

[8]  R. Manmatha,et al.  Automatic Image Annotation and Retrieval using CrossMedia Relevance Models , 2003 .

[9]  Michael W. Berry,et al.  Survey of Text Mining: Clustering, Classification, and Retrieval , 2007 .

[10]  Ching-Yung Lin,et al.  Speech-Based Visual Concept Learning Using Wordnet , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[11]  Ronald Rosenfeld,et al.  A survey of smoothing techniques for ME models , 2000, IEEE Trans. Speech Audio Process..

[12]  Makoto Miyahara,et al.  Mathematical Transform Of (R, G, B) Color Data To Munsell (H, V, C) Color Data , 1988, Other Conferences.

[13]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[14]  R. Manmatha,et al.  Multiple Bernoulli relevance models for image and video annotation , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[15]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..