Progress in Content-based Image Retrieval (CBIR) is hampered by the lack of good evaluation practice and testbenches. In this paper, we raise the awareness of all the parameters that define a content-based indexing and retrieval method. Extensive ground-truth, 15,324 hand-checked image queries, was developed for a portrait database of graylevel images and their backside studio logo’s. Our aim was to clearly demonstrate the diminishing effect of a growing embedding on performance figures, and the establishment of a reliable ranking of several suggested CBIR gray-level indexing methods. This evaluation scheme was used first to optimize a number of parameters defining the detailed workings of each method. The database, standard image queries, ground-truth, and evaluation scripts are offered for inclusion in an evaluation site like Benchathlon. 1. INGREDIENTS FOR A STATISTICALLY MEANINGFUL CBIR TEST SET-UP In [5], we have stated what should be added to recent CBIR system evaluation proposals like [6] to set up an evaluation procedure that completely describes the influence of the relevant parameters of the indexing and retrieval methods. Our main objection, to the so far standard practice of using precision-recall graphs and precision-scope graphs to evaluate and compare methods, was that the influence of the relative size of the relevant image class versus the size of the irrelevant (emdedding) items in a database, characterized as generality, has somehow got lost after the first years of textretrieval evaluations as is shown by [9] and [8]. Therefore, our first objective with this paper is to show convincingly (statistically speaking) that a growing embbedding, of irrelevant items around relevant image classes, diminishes the retrieval performance. This effect characterized by generality (class-size of relevant items/ size of testdatabase) should be combined with the well-known performance measures: precision (retrieved number of relevant items within the scope/scope size) and recall (retrieved number of relevant items within the scope/class size of relevant items). A second objection against present presentations of performance figures is the fact that retrieval scopes are not normalized with respect to the size of the relevant image class. The effect this has on the test set-up and normalization was illustrated in our performance evaluation paper [5]. A major obstacle toward the selection of promising methods for inclusion in commercial image retrieval systems, is the slow development of image search test benches. Initiatives like Benchathlon (http://www.benchathlon.net) deserve more than a one-time backing. Since all indexing and retrieval methods will perform differently for each different set of user queries carried out within each specific embedding, we have to develop a mass of standard user queries that are used within well-defined embeddings to single out those indexing and retrieval methods that can be expected to perform better in general. 1.1. CBIR method definition: essential parameters? A CBIR method can be thought of as a successful combination of indexing and ranking techniques. From a narrowminded perspective this amounts to the extraction of a feature vector (index) and the sorting of images on the basis of similarity measured with respect to a search image (the retrieval); moreover a scope is used to limit the display of retrieved items for visual inspection by a user. In our view different indexing methods may involve much more than merely taking a different recipe for calculating a feature vector. It includes the precise description of all digitization and preprocessing steps taken before digital images are transformed into feature vectors. These steps hould be considered part of the indexing phase of a CBIR method and all relevant parameters (e.g. threshold settings) applied in sub-processes should be described and quantified. Many more parameters than a fixed similarity measure and a scope on the ranking list may be considered as essential parts of the CBIR retrieval method. Multi-dimensional similarity measures, the difference measure itself (for an overview of optimal one-dimensional metrics see [10]), the effect of weighting feature vector elements differently (for instance due to relevance feedback), and the subsequent clustering methods to reorganize and diminish the number and sorting of retrieved items shown, determine which resulting images are shown to the user. Our evaluation procedure must make all these influences visible because otherwise differences in performance might be attributed to the wrong causes, and progress toward better CBIR performance in general would become largely erratic. 2. DATABASE: RELEVANT IMAGES EMBEDDED IN IRRELEVANT IMAGES Any indexing and retrieval method may give reasonable results when its feature space is sparsely filled (due to the high dimensionality of the feature vector and/or due to a small or diverse embedding). Gray value histograms for instance, quickly drop in performance within a growing embedding, whereas color histograms that span a higher dimensional feature space are much more resistant to a growing embedding especially when the embedding items are very diverse. In the long run however, even in wide-domain embeddings, like all images on Internet, color-histogram features will fail to distinguish between too many images. 2.1. A narrow-domain gray-level image database In our experiments, we have used a test database with only gray-level images and all from a narrow-domain. This means a sort of double handicap: we have to rely on intensity distribution features (shape and/or texture features) to cope with the fact that color/gray-level histogram features cannot distinguish well between large groups of gray-level images. 2.2. Defining a class of standard user queries One way of forming test queries is by collecting user queries and providing them with hand-checked image classes that serve as ground-truth during evaluation. The trouble with these queries, e.g. ”find me all images that contain a table as studio prop”, is that although it is not hard to decide image by image whether each image is in the table-class or not, it takes an enormous amount of time to build ground-truth for a large set of such queries. For our database, 21,000 portraits and 21,000 logo’s (at the back of the portraits), building up a statistically significant number of test queries with hand-checked ground-truth is not feasible. Instead, we have implemented ground-truth for two questions that can often be associated with a class of relevant images given different search images taken from the database itself. In these cases, the database can be seen as a multi-class division within an embedding of possibly nonclass items; ground-truth can then be hand-checked for all classes while going through the database once. This set-up uses binary class labels: each image can be a member of one class at maximum. 2.3. The making of ground-truth for the user queries The two problems we considered to generate queries are: • Is there a (noisy) duplicate of this portrait? • Are there other images with this (noisy) studio logo? Part of this ground-truth exercise is described in [4]. Our database, the Leiden 19th-Century Portrait Database (LCPD) at http://nies.liacs.nl:1860, consists of Dutch studio portraiture, so called cartes de visite, that were produced between 1860 and 1914 by the millions. Costumers were usually provided with a dozen copies of their portrait. At the back of the portraits, a studio logo is often present. Over the years different keeping conditions of the cartes distributed among relatives and friends gave rise to differences due to bleaching, staining, and/or annotation (names, dates, collection numbers) between original copy sets of portraits and/or logo’s. These effects were considered additive noise in our model of the database as a multi-class image collection. As a consequence none of the scanned images is a digital copy of any other image. Since duplicates and logo’s mostly come from the same studio, a first division of the portrait database was made based on textual information about the studio, town, and address resulting in 3650 studio classes. After this step, the finding of duplicates and identical logo classes is restricted to detection of duplicates and grouping of different logo’s into clusters within each of the 3650 studio entries. That way 238 (noisy) duplicate pairs for the portraits and 1856 (noisy) logo classes with at 2 til 300 members (average size 8) could be formed efficiently by going through the image database a second time (studio by studio) for 42,000 images in all. This way 15,324 image queries with ground-truth answers have been generated. 3. METHODS: DIFFERENCES IN INPUT PREPARATION, INDEXES, AND RETRIEVAL The following procedure describes how portraits were digitized and preprocessed before feature vectors were formed. The original portraits were scanned at 300 dpi and down sampled by averaging to create digital input at a range of resolutions; this resolution in dpi is one of the parameters of a CBIR method. Because most of the feature vectors we wanted to extract are sensitive to scale, rotation, and translation, all the digitized images were made invariant to these geometric changes by using a uniform resolution, standard orientation, and standard cropping procedure for all scanned images. All images were also made invariant to some of the lighting effects by contrast-stretching. The images were then ready for feature extraction based on the intensity-domain. For those feature vectors obtained from the gradientor binarized gradient-domain, the Sobel 3x3 gradient magnitude image was formed and thresholded into binary images where needed. For all methods compared here, images underwent the same input preparation phase; the main differences are the way feature vectors are formed during the index phase: • RANDOM: input preparation,
[1]
Michael S. Lew,et al.
2D Pixel Trigrams for Content-Based Image Retrieval
,
1998,
Image Databases and Multi-Media Search.
[2]
Michael S. Lew,et al.
Quality Measures for Interactive Image Retrieval with a Performance Evaluation of Two 3x3 Texel-based Methods
,
1997,
ICIAP.
[3]
Matti Pietikäinen,et al.
A comparative study of texture measures with classification based on featured distributions
,
1996,
Pattern Recognit..
[4]
Alexander Dekhtyar,et al.
Information Retrieval
,
2018,
Lecture Notes in Computer Science.
[5]
Nicu Sebe,et al.
A Ground-Truth Training Set for Hierarchical Clustering in Content-Based Image Retrieval
,
2000,
VISUAL.
[6]
Nicu Sebe,et al.
Extended performance graphs for cluster retrieval
,
2001,
Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.
[7]
Thierry Pun,et al.
Performance evaluation in content-based image retrieval: overview and proposals
,
2001,
Pattern Recognit. Lett..
[8]
Nicu Sebe,et al.
Toward Improved Ranking Metrics
,
2000,
IEEE Trans. Pattern Anal. Mach. Intell..
[9]
Michael S. Lew,et al.
Efficient content-based image retrieval in digital picture collections using projections: (near)-copy location
,
1996,
Proceedings of 13th International Conference on Pattern Recognition.
[10]
Gerard Salton,et al.
The "generality" effect and the retrieval evaluation for large collections
,
1972,
J. Am. Soc. Inf. Sci..