On Gapped Set Intersection Size Estimation

There exists considerable literature on estimating the cardinality of set intersection result. In this paper, we consider a generalized problem for integer sets where, given a gap parameter δ, two elements are deemed as matches if their numeric difference equals δ or is within δ. We call this problem the gapped set intersection size estimation (GSISE/), and it can be used to model applications in database systems, data mining, and information retrieval. We first distinguish two subtypes of the estimation problem: the point gap estimation and range gap estimation. We propose optimized sketches to tackle the two problems efficiently and effectively with theoretical guarantees. We demonstrate the usage of our proposed techniques in mining top-K related keywords efficiently, by integrating with an inverted index. Finally, substantial experiments based on a large subset of the ClueWed09 dataset demonstrate the efficiency and effectiveness of the proposed methods.

[1]  Yuichi Yoshida,et al.  Conjunctive Filter: Breaking the Entropy Barrier , 2010, ALENEX.

[2]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[3]  Desh Ranjan,et al.  Balls and bins: A study in negative dependence , 1996, Random Struct. Algorithms.

[4]  Frank K. Hwang,et al.  A Simple Algorithm for Merging Two Disjoint Linearly-Ordered Sets , 1972, SIAM J. Comput..

[5]  Edith Cohen,et al.  Size-Estimation Framework with Applications to Transitive Closure and Reachability , 1997, J. Comput. Syst. Sci..

[6]  Edith Cohen,et al.  Leveraging discarded samples for tighter estimation of multiple-set aggregates , 2009, SIGMETRICS '09.

[7]  Wolfgang Lehner,et al.  Fast Sorted-Set Intersection using SIMD Instructions , 2011, ADMS@VLDB.

[8]  Christopher Potts,et al.  Sentiment expression conditioned by affective transitions and social forces , 2014, KDD.

[9]  Alejandro López-Ortiz,et al.  Faster Adaptive Set Intersections for Text Searching , 2006, WEA.

[10]  Bolin Ding,et al.  Fast Set Intersection in Memory , 2011, Proc. VLDB Endow..

[11]  Ronald Fagin,et al.  Combining Fuzzy Information from Multiple Systems , 1999, J. Comput. Syst. Sci..

[12]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[13]  Alejandro López-Ortiz,et al.  An experimental investigation of set intersection algorithms for text searching , 2010, JEAL.

[14]  Wolfgang Lehner,et al.  Fast integer compression using SIMD instructions , 2010, DaMoN '10.

[15]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[16]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[17]  Edith Cohen,et al.  Tighter estimation using bottom k sketches , 2008, Proc. VLDB Endow..

[18]  Xuemin Lin,et al.  Top-k Set Similarity Joins , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[19]  Edith Cohen,et al.  Summarizing data using bottom-k sketches , 2007, PODC '07.

[20]  Peter J. Haas,et al.  On synopses for distinct-value estimation under multiset operations , 2007, SIGMOD '07.

[21]  Xuemin Lin,et al.  SPARK2: Top-k Keyword Query in Relational Databases , 2007, IEEE Transactions on Knowledge and Data Engineering.

[22]  Gerhard Weikum,et al.  TopX: efficient and versatile top-k query processing for semistructured data , 2007, The VLDB Journal.

[23]  David P. Woodruff,et al.  Is min-wise hashing optimal for summarizing set intersection? , 2014, PODS.

[24]  J. Shane Culpepper,et al.  Efficient set intersection for inverted indexing , 2010, TOIS.

[25]  K.Ankamma chowdary Efficient Processing of Top-k Spatial Preference Queries , 2012 .

[26]  Daisuke Takuma,et al.  Faster upper bounding of intersection sizes , 2013, SIGIR.

[27]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[28]  Erik D. Demaine,et al.  Adaptive set intersections, unions, and differences , 2000, SODA '00.