Automatic object model acquisition and object recognition by integrating linguistic and visual information

In order to make the best use of multimedia contents effectively, the crucial point is the structural analysis of the contents, in which several media processing techniques, including image, audio and text analyses, should be integrated. To understand utterances in videos in accordance with the scene, it is essential to recognize what object appears in the videos. In this paper, we focus on Japanese cooking TV videos, and propose a method for acquiring object models of foods in an unsupervised manner and performing object recognition based on the acquired object models. First, a topic of each video segment is identified based on HMMs to obtain good examples for the object model acquisition. After that, close-up images are extracted from image sequences, and an attention region on the close-up image is determined. Then, an important word is extracted as a keyword from utterances around the close-up image, and is made correspond to the close-up image. By collecting a set of close-up image and keyword from a large amount of videos, object models are acquired. After acquiring the object models, object recognition is performed based on the acquired object models and linguistic information. We conducted experiments on two kinds of cooking TV programs. We acquired the object models of around 100 foods with an accuracy 77.8%. The F measure of object recognition was 0.727.

[1]  Tat-Seng Chua,et al.  A bootstrapping approach to annotating large image collection , 2003, MIR '03.

[2]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[3]  Toyoaki Nishida,et al.  Structural Analysis of Instruction Utterances Using Linguistic and Visual Information , 2004, KES.

[4]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[5]  Eric Fosler-Lussier,et al.  Discourse Segmentation of Multi-Party Conversation , 2003, ACL.

[6]  Rainer Lienhart,et al.  Comparison of automatic shot boundary detection algorithms , 1998, Electronic Imaging.

[7]  Sadao Kurohashi,et al.  Unsupervised Topic Identification by Integrating Linguistic and Visual Information Based on Hidden Markov Models , 2006, ACL.

[8]  Candace L. Sidner,et al.  Attention, Intentions, and the Structure of Discourse , 1986, CL.

[9]  Osamu Nakamura,et al.  Human-face extraction using modified HSV color system and personal identification through facial image based on isodensity maps , 1995, Proceedings 1995 Canadian Conference on Electrical and Computer Engineering.

[10]  Makoto Nagao,et al.  A Syntactic Analysis Method of Long Japanese Sentences Based on the Detection of Conjunctive Structures , 1994, CL.

[11]  Daisuke Kawahara,et al.  Fertilization of Case Frame Dictionary for Robust Japanese Case Analysis , 2002, COLING.

[12]  Keiji Yanai,et al.  Generic image classification using visual knowledge on the web , 2003, ACM Multimedia.

[13]  Ichiro Ide,et al.  Associating cooking video with related textbook , 2000, MULTIMEDIA '00.