Learning semantic multimedia representations from a small set of examples

We approach the problem of semantic multimedia retrieval as a supervised learning problem. Defining a lexicon of a small number of interesting semantic concepts we can handle a number of semantic queries. Since the number of interesting concepts available for training is usually small we explore discriminant learning techniques. In particular, we examine the use of kernel based methods and demonstrate impressive retrieval performance using semantic concepts like rocket, outdoor, greenery, sky and face. We also show that loosely coupled multimodal events can be detected based on the late fusion of detection of related auditory and visual concepts. Using a Bayesian network for inference we show how a rocket-launch event can be detected based on the detection of a related visual concept (rocket object) and a related auditory concept (explosion/blast-off).

[1]  John R. Smith,et al.  Learning to annotate video databases , 2001, IS&T/SPIE Electronic Imaging.

[2]  Milind R. Naphade,et al.  A probabilistic framework for semantic video indexing, filtering, and retrieval , 2001, IEEE Trans. Multim..

[3]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[4]  Brendan J. Frey,et al.  Probabilistic multimedia objects (multijects): a novel approach to video indexing and retrieval in multimedia systems , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[5]  Milind R. Naphade,et al.  Classifying motion picture soundtrack for video indexing , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[6]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[7]  Shih-Fu Chang,et al.  Semantic visual templates: linking visual features to semantics , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[8]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[9]  Shih-Fu Chang,et al.  VisualSEEk: a fully automated content-based image query system , 1997, MULTIMEDIA '96.

[10]  T.S. Huang,et al.  Recognizing high-level audio-visual concepts using context , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).