CoPhIR Image Collection under the Microscope

The Content-based Photo Image Retrieval (CoPhIR) dataset is the largest available database of digital images with corresponding visual descriptors. It contains five MPEG-7 global descriptors extracted from more than 106 million images from Flickr photo-sharing system. In this paper, we analyze this dataset focusing on 1) efficiency of similarity-based indexing and searching and on 2) expressiveness of combination of the descriptors with respect to subjective perception of visual similarity. We treat the descriptors as metric spaces and then combine them into a multi-metric space. We analyze distance distributions of individual descriptors, measure intrinsic dimensionality of these datasets and statistically evaluate correlation between these descriptors. Further, we use two methods to assess subjective accuracy and satisfaction of similarity retrieval based on a combination of descriptors that is recommended for CoPhIR, and we compare these results on databases of 10 and 100 million CoPhIR images. Finally, we suggest, explore and evaluate two approaches to improve the accuracy: 1) applying logarithms in order to weaken influence of a single descriptor contribution if it deviates from the rest, and 2) the possibility of categorization of the dataset and identifying visual characteristics important for individual categories.

[1]  David Novak,et al.  Generic similarity search engine demonstrated by an image retrieval application , 2009, SIGIR.

[2]  Matthew Skala,et al.  Measuring the Difficulty of Distance-Based Indexing , 2005, SPIRE.

[3]  Matthew Skala,et al.  Counting distance permutations , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[4]  Pavel Zezula,et al.  Similarity Search: The Metric Space Approach (Advances in Database Systems) , 2005 .

[5]  B. S. Manjunath,et al.  Introduction to MPEG-7: Multimedia Content Description Interface , 2002 .

[6]  James Ze Wang,et al.  Image retrieval: Ideas, influences, and trends of the new age , 2008, CSUR.

[7]  E. Chávez,et al.  Measuring the Dimensionality of General Metric Spaces , 2000 .

[8]  Paul Corazza,et al.  INTRODUCTION TO METRIC-PRESERVING FUNCTIONS , 1999 .

[9]  Pavel Zezula,et al.  Similarity Search - The Metric Space Approach , 2005, Advances in Database Systems.

[10]  Claudio Gennaro,et al.  IMPROVING IMAGE SIMILARITY SEARCH EFFECTIVENESS IN A MULTIMEDIA CONTENT MANAGEMENT SYSTEM , 2004 .

[11]  Ronald Fagin,et al.  Combining fuzzy information: an overview , 2002, SGMD.

[12]  Andrea Esuli,et al.  CoPhIR: a Test Collection for Content-Based Image Retrieval , 2009, ArXiv.

[13]  David Novak,et al.  Web-scale system for image similarity search: When the dreams are coming true , 2008, 2008 International Workshop on Content-Based Multimedia Indexing.

[14]  Christoph Rasche,et al.  An Approach to the Parameterization of Structure for Fast Categorization , 2010, International Journal of Computer Vision.