Efficient top-k similarity join processing over multi-valued objects

The top-k similarity joins have been extensively studied and used in a wide spectrum of applications such as information retrieval, decision making, spatial data analysis and data mining. Given two sets of objects $\mathcal U$ and $\mathcal V$, a top-k similarity join returns k pairs of most similar objects from $\mathcal U \times \mathcal V$. In the conventional model of top-k similarity join processing, an object is usually regarded as a point in a multi-dimensional space and the similarity is measured by some simple distance metrics like Euclidean distance. However, in many applications an object may be described by multiple values (instances) and the conventional model is not applicable since it does not address the distributions of object instances. In this paper, we study top-k similarity join over multi-valued objects. We apply two types of quantile based distance measures, ϕ-quantile distance and ϕ-quantile group-base distance, to explore the relative instance distribution among the multiple instances of objects. Efficient and effective techniques to process top-k similarity joins over multi-valued objects are developed following a filtering-refinement framework. Novel distance, statistic and weight based pruning techniques are proposed. Comprehensive experiments on both real and synthetic datasets demonstrate the efficiency and effectiveness of our techniques.

[1]  Jeffrey Scott Vitter,et al.  Efficient join processing over uncertain data , 2006, CIKM '06.

[2]  Haixun Wang,et al.  Efficiently Monitoring Top-k Pairs over Sliding Windows , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[3]  Raymond T. Ng,et al.  Finding Aggregate Proximity Relationships and Commonalities in Spatial Data Mining , 1996, IEEE Trans. Knowl. Data Eng..

[4]  Panos Kalnis,et al.  Efficient OLAP Operations in Spatial Data Warehouses , 2001, SSTD.

[5]  Katarzyna Musial,et al.  Creation and growth of online social network , 2013, World Wide Web.

[6]  Ramez Elmasri,et al.  Fundamentals of Database Systems , 1989 .

[7]  Dimitris Papadias,et al.  Multiway spatial joins , 2001, ACM Trans. Database Syst..

[8]  Donald Kossmann,et al.  The Skyline operator , 2001, Proceedings 17th International Conference on Data Engineering.

[9]  Dieter Jungnickel,et al.  Approximate minimization algorithms for the 0/1 Knapsack and Subset-Sum Problem , 2000, Oper. Res. Lett..

[10]  Elisa Bertino,et al.  Continuous Intersection Joins Over Moving Objects , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[11]  Elke A. Rundensteiner,et al.  Spatial Joins Using R-trees: Breadth-First Traversal with Global Optimizations , 1997, VLDB.

[12]  Chen Wang,et al.  Detecting Overlapping Community Structures in Networks , 2009, World Wide Web.

[13]  Muhammad Aamir Cheema,et al.  Quantile-based KNN over multi-valued objects , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[14]  Min-Jae Lee,et al.  Transform-space view: performing spatial join in the transform space using original-space indexes , 2006, IEEE Transactions on Knowledge and Data Engineering.

[15]  Yannis Manolopoulos,et al.  Closest pair queries in spatial databases , 2000, SIGMOD '00.

[16]  Ralf Rantzau,et al.  Cost-Based Predictive Spatiotemporal Join , 2009, IEEE Transactions on Knowledge and Data Engineering.

[17]  Muhammad Aamir Cheema,et al.  Stochastic skyline operator , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[18]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[19]  Hans-Peter Kriegel,et al.  Efficient processing of spatial joins using R-trees , 1993, SIGMOD Conference.

[20]  Hanan Samet,et al.  Incremental distance join algorithms for spatial databases , 1998, SIGMOD '98.

[21]  Agnès Voisard,et al.  Spatial Databases: With Application to GIS , 2001 .

[22]  Yufei Tao,et al.  Efficient Quantile Retrieval on Multi-dimensional Data , 2006, EDBT.

[23]  Ambuj K. Singh,et al.  Top-k Spatial Joins of Probabilistic Objects , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[24]  Kai Zheng,et al.  K-nearest neighbor search for fuzzy objects , 2010, SIGMOD Conference.

[25]  Ronaldus W. Meester A Natural Introduction to Probability Theory , 2004 .

[26]  Ronald L. Rivest,et al.  Introduction to Algorithms, 3rd Edition , 2009 .

[27]  Hans-Peter Kriegel,et al.  Probabilistic Similarity Join on Uncertain Data , 2006, DASFAA.

[28]  Nancy Wiegand,et al.  Review of Spatial databases with application to GIS by Philippe Rigaux, Michel Scholl, and Agnes Voisard. Morgan Kaufmann 2002. , 2003, SGMD.

[29]  Agnès Voisard,et al.  Spatial databases - with applications to GIS , 2002 .

[30]  Jianmin Wang,et al.  A unified approach for computing top-k pairs in multidimensional space , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[31]  Hanan Samet,et al.  Distance join queries on spatial networks , 2006, GIS '06.