Using Visual-Textual Mutual Information and Entropy for Inter-modal Document Indexing

This paper presents a contribution in the domain of automatic visual document indexing based on inter-modal analysis, in the form of a statistical indexing model. The approach is based on intermodal document analysis, which consists in modeling and learning some relationships between several modalities from a data set of annotated documents in order to extract semantics. When one of the modalities is textual, the learned associations can be used to predict a textual index for visual data from a new document (image or video). More specifically, the presented approach relies on a learning process in which associations between visual and textual information are characterized by the mutual information of the modalities. Besides, the model uses the information entropy of the distribution of the visual modality against the textual modality as a second source to select relevant indexing terms. We have implemented the proposed information theoretic model, and the results of experiments assessing its performance on two collections (image and video) show that information theory is an interesting framework to automatically annotate documents.