Caching content-based queries for robust and efficient image retrieval

In order to become an effective complement to traditional Web-scale text-based image retrieval solutions, content-based image retrieval must address scalability and efficiency issues. In this paper we investigate the possibility of caching the answers to content-based image retrieval queries in metric space, with the aim of reducing the average cost of query processing, and boosting the overall system throughput. Our proposal exploits the similarity between the query object and the cache content, and allows the cache to return approximate answers with acceptable quality guarantee even if the query processed has never been encountered in the past. Moreover, since popular images that are likely to be used as query have several near-duplicate versions, we show that our caching algorithm is robust, and does not suffer of cache pollution problems due to near-duplicate query objects. We report on very promising results obtained with a collection of one million high-quality digital photos. We show that it is worth pursuing caching strategies also in similarity search systems, since the proposed caching techniques can have a significant impact on performance, like caching on text queries has been proven effective for traditional Web search engines.

[1]  Hanan Samet,et al.  Foundations of multidimensional and metric data structures , 2006, Morgan Kaufmann series in data management systems.

[2]  Roelof van Zwol,et al.  Flickr: Who is Looking? , 2007, Web Intelligence.

[3]  Fabrizio Silvestri,et al.  Boosting the performance of Web search engines: Caching and prefetching query results by exploiting historical usage data , 2006, TOIS.

[4]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[5]  Hanan Samet,et al.  Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling) , 2005 .

[6]  Shlomo Moran,et al.  Predictive caching and prefetching of query results in search engines , 2003, WWW '03.

[7]  Z. Meral Özsoyoglu,et al.  Indexing large metric spaces for similarity search queries , 1999, TODS.

[8]  Divyakant Agrawal,et al.  Approximate nearest neighbor searching in multimedia databases , 2001, Proceedings 17th International Conference on Data Engineering.

[9]  Evangelos P. Markatos,et al.  On caching search engine query results , 2001, Comput. Commun..

[10]  Christian Böhm,et al.  Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases , 2001, CSUR.

[11]  Hanan Samet,et al.  Index-driven similarity search in metric spaces (Survey Article) , 2003, TODS.

[12]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[13]  Justin Zobel,et al.  Detection of near-duplicate images for web search , 2007, CIVR '07.

[14]  Pavel Zezula,et al.  Similarity Search - The Metric Space Approach , 2005, Advances in Database Systems.

[15]  Yinglian Xie,et al.  Locality in search engine queries and its implications for caching , 2002, Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies.

[16]  Pavel Zezula,et al.  Similarity Search: The Metric Space Approach (Advances in Database Systems) , 2005 .

[17]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[18]  B. S. Manjunath,et al.  Introduction to MPEG-7: Multimedia Content Description Interface , 2002 .

[19]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[20]  James Ze Wang,et al.  Image retrieval: Ideas, influences, and trends of the new age , 2008, CSUR.

[21]  Salvatore Orlando,et al.  A metric cache for similarity search , 2008, LSDS-IR '08.