Enabling Content-Based Image Retrieval in Very Large Digital Libraries

Enabling effective and efficient Content-Based Image Retrieval (CBIR) on Very Large Digital Libraries (VLDLs), is today an important research issue. While there exist well-known approaches for information retrieval on textual content for VLDLs, the research for an effective CBIR method that is also able to scale to very large collections is still open. A practical effect of this situation is that most of the image retrieval services currently available for VLDLs are based only on textual metadata. In this paper, we report on our experience in creating a collection of 106 million images, i.e., the CoPhIR collection, the largest currently available to the scientific community for research purposes.We discuss the various issues arising from working with a such large collection and dealing with a complex retrieval model on information-rich features. We present the non-trivial process of image crawling and descriptive feature extraction, using the European EGEE computer GRID. The feature extraction phase is often ignored when discussing the scalability issue while, as we show in this work, it could be one of the toughest issues to be solved in order to make CBIR feasible on VLDLs.

[1]  C. Gennaro,et al.  Selection of MPEG-7 Image Features for Improving Image Similarity Search on Specific Data Sets , 2004 .

[2]  Claudio Gennaro,et al.  IMPROVING IMAGE SIMILARITY SEARCH EFFECTIVENESS IN A MULTIMEDIA CONTENT MANAGEMENT SYSTEM , 2004 .

[3]  David Novak,et al.  Scalability comparison of Peer-to-Peer similarity search structures , 2008, Future Gener. Comput. Syst..

[4]  B. S. Manjunath,et al.  Introduction to MPEG-7: Multimedia Content Description Interface , 2002 .

[5]  Andrea Esuli,et al.  CoPhIR: a Test Collection for Content-Based Image Retrieval , 2009, ArXiv.

[6]  James Ze Wang,et al.  Image retrieval: Ideas, influences, and trends of the new age , 2008, CSUR.

[7]  P. Beek,et al.  Text of 15938-5 FCD Information Technology-Multimedia Content Description Interface-Pard 5 Multimedia Description Schemes , 2001 .

[8]  Salvatore Orlando,et al.  Caching content-based queries for robust and efficient image retrieval , 2009, EDBT '09.

[9]  Andrea Esuli,et al.  PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search , 2009, LSDS-IR@SIGIR.

[10]  Andrea Esuli MiPai: Using the PP-Index to Build an Efficient and Scalable Similarity Search System , 2009, 2009 Second International Workshop on Similarity Search and Applications.

[11]  David Novak,et al.  Building a web-scale image similarity search system , 2010, Multimedia Tools and Applications.

[12]  David Novak,et al.  MESSIF: Metric Similarity Search Implementation Framework , 2007, DELOS.

[13]  Pasquale Savino,et al.  Approximate similarity search in metric spaces using inverted files , 2008, Infoscale.