Efficient online top-K retrieval with arbitrary similarity measures

The top-k retrieval problem requires finding k objects most similar to a given query object. Similarities between objects are most often computed as aggregated similarities of their attribute values. We consider the case where the similarities between attribute values are arbitrary (non-metric), due to which standard space partitioning indexes cannot be used. Among the most popular techniques that can handle arbitrary similarity measures is the family of threshold algorithms. These were designed as middleware algorithms that assume that similarity lists for each attribute are available and focus on efficiently merging these lists to arrive at the results. In this paper, we explore multi-dimensional indexing of non-metric spaces that can lead to efficient pruning of the search space utilizing inter-attribute relationships, during top-k computation. We propose an indexing structure, the AL-Tree and an algorithm to do top-k retrieval using it in an online fashion. The ALTree exploits the fact that many real world attributes come from a small value space. We show that our algorithm performs much better than the threshold based algorithms in terms of computational cost due to efficient pruning of the search space. Further, it out-performs them in terms of IOs by upto an order of magnitude in case of dense datasets.

[1]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[2]  Tomás Skopal,et al.  On Fast Non-metric Similarity Search by Metric Access Methods , 2006, EDBT.

[3]  Enrique Vidal,et al.  New formulation and improvements of the nearest-neighbour approximating and eliminating search algorithm (AESA) , 1994, Pattern Recognit. Lett..

[4]  Pavel Zezula,et al.  D-Index: Distance Searching Index for Metric Data Sets , 2003, Multimedia Tools and Applications.

[5]  Thomas Mandl Learning Similarity Functions in Information Retrieval , 1998 .

[6]  Gerhard Weikum,et al.  Top-k Query Evaluation with Probabilistic Guarantees , 2004, VLDB.

[7]  Ronald Fagin,et al.  Combining fuzzy information from multiple systems (extended abstract) , 1996, PODS.

[8]  Gerhard Weikum,et al.  IO-Top-k: index-access optimized top-k query processing , 2006, VLDB.

[9]  Ronald Fagin,et al.  Combining fuzzy information: an overview , 2002, SGMD.

[10]  Pavel Zezula,et al.  Similarity Search - The Metric Space Approach , 2005, Advances in Database Systems.

[11]  Seung-won Hwang,et al.  Minimal probing: supporting expensive predicates for top-k queries , 2002, SIGMOD '02.

[12]  Luis Gravano,et al.  Evaluating top-k queries over web-accessible databases , 2004, TODS.

[13]  Bruce L. Worthington,et al.  Windows 2000 Disk IO Performance , 2000 .

[14]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[16]  Man Lung Yiu,et al.  Efficient Aggregation of Ranked Inputs , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[17]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[18]  Rakesh Agrawal,et al.  On learning asymmetric dissimilarity measures , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[19]  Edward Y. Chang,et al.  DynDex: a dynamic and non-metric space indexer , 2002, MULTIMEDIA '02.

[20]  Jiawei Han,et al.  Progressive and selective merge: computing top-k with ad-hoc ranking functions , 2007, SIGMOD '07.

[21]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[22]  D. J. Newman,et al.  UCI Repository of Machine Learning Database , 1998 .

[23]  Wolf-Tilo Balke,et al.  Towards efficient multi-feature queries in heterogeneous environments , 2001, Proceedings International Conference on Information Technology: Coding and Computing.

[24]  Iraj Kalantari,et al.  A Data Structure and an Algorithm for the Nearest Point Problem , 1983, IEEE Transactions on Software Engineering.

[25]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[26]  Pavel Zezula,et al.  Similarity Search: The Metric Space Approach (Advances in Database Systems) , 2005 .

[27]  Ronald Fagin,et al.  Combining Fuzzy Information from Multiple Systems , 1999, J. Comput. Syst. Sci..

[28]  Luis Gravano,et al.  Evaluating top-k queries over Web-accessible databases , 2002, Proceedings 18th International Conference on Data Engineering.