Active caching for similarity queries based on shared-neighbor information

Novel applications such as recommender systems, uncertain databases, and multimedia databases are designed to process similarity queries that produce ranked lists of objects as their results. Similarity queries typically result in disk access latency and incur a substantial computational cost. In this paper, we propose an 'active caching' technique for similarity queries that is capable of synthesizing query results from cached information even when the required result list is not explicitly stored in the cache. Our solution, the Cache Estimated Significance (CES) model, is based on shared-neighbor similarity measures, which assess the strength of the relationship between two objects as a function of the number of other objects in the common intersection of their neighborhoods. The proposed method is general in that it does not require that the features be drawn from a metric space, nor does it require that the partial orders induced by the similarity measure be monotonic. Experimental results on real data sets show a substantial cache hit rate when compared with traditional caching approaches.

[1]  Wang-Chien Lee,et al.  On Semantic Caching and Query Scheduling for Mobile Nearest-Neighbor Search , 2004, Wirel. Networks.

[2]  Wagner Meira,et al.  Rank-preserving two-level caching for scalable search engines , 2001, SIGIR '01.

[3]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[4]  Jun Sakuma,et al.  Fast approximate similarity search in extremely high-dimensional data sets , 2005, 21st International Conference on Data Engineering (ICDE'05).

[5]  D. S. Moore,et al.  The Basic Practice of Statistics , 2001 .

[6]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[7]  Darrell D. E. Long,et al.  Exploring the Bounds of Web Latency Reduction from Caching and Prefetching , 1997, USENIX Symposium on Internet Technologies and Systems.

[8]  Ken C. K. Lee,et al.  Semantic query caching in a mobile environment , 1999, MOCO.

[9]  Divesh Srivastava,et al.  Semantic Data Caching and Replacement , 1996, VLDB.

[10]  S. M. Shafi,et al.  Precision and Recall of Five Search Engines for Retrieval of Scholarly Information in the Field of Biotechnology , 2005, Webology.

[11]  Torsten Suel,et al.  Performance of compressed inverted list caching in search engines , 2008, WWW.

[12]  R. Forthofer,et al.  Rank Correlation Methods , 1981 .

[13]  Torsten Suel,et al.  Three-Level Caching for Efficient Query Processing in Large Web Search Engines , 2005, WWW '05.

[14]  Vipin Kumar,et al.  Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data , 2003, SDM.

[15]  Boris Chidlovskii,et al.  Semantic caching of Web queries , 2000, The VLDB Journal.

[16]  Arnold W. M. Smeulders,et al.  The Amsterdam Library of Object Images , 2004, International Journal of Computer Vision.

[17]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[18]  Wenwei Xue,et al.  Form-based proxy caching for database-backed web sites: keywords and functions , 2006, The VLDB Journal.

[19]  Evangelos P. Markatos,et al.  A top- 10 approach to prefetching on the web , 1996 .

[20]  Ray A. Jarvis,et al.  Clustering Using a Similarity Measure Based on Shared Near Neighbors , 1973, IEEE Transactions on Computers.

[21]  Emerson W. Pugh,et al.  IBM's 360 and early 370 systems , 1991 .

[22]  Michael E. Houle The Relevant-Set Correlation Model for Data Clustering , 2008, Stat. Anal. Data Min..

[23]  Vijay Kumar,et al.  Semantic Caching and Query Processing , 2003, IEEE Trans. Knowl. Data Eng..

[24]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).