Translating Images to Keywords: Problems, Applications and Progress

The development of technology generates huge amounts of non-textual information, such as images. An efficient image annotation and retrieval system is highly desired. Clustering algorithms make it possible to represent visual features of images with finite symbols. Based on this, many statistical models, which analyze correspondence between visual features and words and discover hidden semantics, have been published. These models improve the annotation and retrieval of large image databases. However, image data usually have a large number of dimensions. Traditional clustering algorithms assign equal weights to these dimensions, and become confounded in the process of dealing with these dimensions. In this tutorial, first, we will present current state of the art and its shortcomings. We will present some classical models (e.g., translation model (TM), cross-media relevance model etc.). Second, we will present weighted feature selection algorithm as a solution to the existing problem. For a given cluster, we determine relevant features based on histogram analysis and assign greater weight to relevant features as compared to less relevant features. Third, we will exploit spatial correlation to disambiguate visual features, and spatial relationship will be constructed by spatial association rule mining. Fourth, we will present the continuous relevance model and multiple Bernoulli model for avoiding clustering. We will present mechanisms to link visual tokens with keywords based on these models. Fifth, we will present mechanisms to improve accuracy of classical model, TM by exploiting the WordNet knowledge-base. Sixth, we will present a framework to model semantic visual concept in video/images by fusing multiple evidence with the usage of an ontology. Seventh, we will show that weighted feature selection is better than traditional ones (TM) for automatic image annotation and retrieval. Finally, we will discuss open problems and future directions in the domain of image and video.