A Multimodal Image Database System

We demonstrate PBIR , an integrated system that we have built for conducting multimodal image retrieval. The system combines the strengths of content-based soft annotation (CBSA), multimodal relevance feedback through active learning, and perceptual distance formulation and indexing. PBIR supports multimodal query and annotation in any combination of its three basic modes: seed-by-nothing, seedby-keywords, and seed-by-content. We demonstrate PBIR on a couple of very large image sets provided by image vendors and crawled from the Internet. 1 Overview For a search engine to perform effective searches, it has to comprehend what a user wants. Our demonstration presents a multimodal perception-based image retrieval system (PBIR ), which can capture users’ subjective queryconcepts thoroughly, and hence achieve high search accuracy. PBIR improves over PBIR (our previous demonstration [2]) by the following technologies that we have recently developed: Content-based soft annotation, Multimodal (textual and perceptual) relevance feedback through active learning, and Perceptual distance formulation and indexing. 2 Content-based Soft Annotation Content-based image retrieval supports image searches based on perceptual features, such as color, texture, and shape. However, for most users, articulating a content-based query using these low-level features can be non-intuitive and difficult. Many users prefer to using keywords to conduct searches. We believe that a keywordand content-based combined approach can benefit from the strengths of these two paradigms. A user can start a query by entering a few keywords. Once some images relevant to the query are found, the image system can use these images’ perceptual features, together with their annotation, to perform multimodal query refinement. Images must be annotated to support such keywordand content-based combined queries and refinement. In [3] we propose a content-based soft annotation (CBSA) approach to provide images each with multiple semantical labels. The input to CBSA is a training image set, each image in the set is manually annotated with one single semantical label. CBSA propagates these labels to unlabeled images as well as the labeled ones. At the end of the annotation process, each image is annotated with a label-vector, and each label in the vector a confidence factor. For instance, each image in a training set is initially labeled with one of K labels such as forest, tiger, sky, etc. Each image at the end of the CBSA process is annotated with a word vector of K labels. An image label-vector (forest : 0:1; tiger : 0:9; sky : 0:7; ) means that the image is believed to contain semantics of forest, tiger, and sky with 10%, 90%, and 70% confidence, respectively. When a text-based search is issued with keywords, images are ranked and retrieved based on their combined confidence factors on the matching labels. The content-based soft annotation algorithm consists of the following three steps: 1. Manually labeling a set of training images each with one of the pre-selected K semantical labels. 2. Training K classifiers. Based on the labeled instances, we train an ensemble of K BPM (Bayes Point Machine) binary classifiers. Each classifier is responsible for determining the confidence factor for a semantical label. 3. Annotating images using the classifiers. Each image is classified by the K classifiers. Each classifier gives each image a confidence factor on the label that the classifier is responsible for predicting. As a result, a K-nary vector consisting of K-class membership is generated for each image. 3 Multimodal Active Learning As pointed out by [1], automatic annotation may not attain extremely high accuracy at the present state of computer vision and image processing. However, providing images with some reliable semantical labels and then refining these unconfirmed labels via relevance feedback is believed an effective approach [14]. CBSA initializes images with a set of semantical words significantly better than chance. Our empirical study shows that even though the initial annotation may not be perfect, CBSA assists a user to quickly find some relevant images via a keyword search. Once some relevant images can be found, query refinement methods such as MEGA [8] and SVMActive [12] can be employed to quickly zoom into the user’s query concept. In [11], we show that the annotation quality can be improved by using user feedback collected from active learning sessions. When a user types in a keyword, say W (W can also be collected from annotated images that are marked by the user as “relevant” to his or her target concept), we select the images which are most difficult for the active learning algorithm to determine their membership to keyword W , and use those to so-

[1]  Edward Y. Chang,et al.  Effective image annotation via active learning , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[2]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[3]  David A. Forsyth,et al.  Learning the semantics of words and pictures , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[4]  Mary Czerwinski,et al.  Semi-Automatic Image Annotation , 2001, INTERACT.

[5]  D. Gentner,et al.  Respects for similarity , 1993 .

[6]  Edward Y. Chang,et al.  DynDex: a dynamic and non-metric space indexer , 2002, MULTIMEDIA '02.

[7]  D. Medin,et al.  The role of theories in conceptual coherence. , 1985, Psychological review.

[8]  Edward Y. Chang,et al.  CBSA: content-based soft annotation for multimodal image retrieval using Bayes point machines , 2003, IEEE Trans. Circuits Syst. Video Technol..

[9]  A. Tversky Features of Similarity , 1977 .

[10]  Edward Y. Chang,et al.  Mining image features for efficient query processing , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[11]  Edward Y. Chang,et al.  PBIR - perception-based image retrieval , 2001, SIGMOD '01.

[12]  Edward Y. Chang,et al.  Support vector machine active learning for image retrieval , 2001, MULTIMEDIA '01.

[13]  Edward Y. Chang,et al.  Discovery of a perceptual distance function for measuring image similarity , 2003, Multimedia Systems.