C2Net: A Network-Efficient Approach to Collision Counting LSH Similarity Join

Similarity join of two datasets <inline-formula><tex-math notation="LaTeX">$P$</tex-math><alternatives><inline-graphic xlink:href="yu-ieq1-2836464.gif"/></alternatives></inline-formula> and <inline-formula><tex-math notation="LaTeX">$Q$</tex-math><alternatives><inline-graphic xlink:href="yu-ieq2-2836464.gif"/></alternatives></inline-formula> is a primitive operation that is useful in many application domains. The operation involves identifying pairs <inline-formula><tex-math notation="LaTeX">$(p,q)$</tex-math><alternatives><inline-graphic xlink:href="yu-ieq3-2836464.gif"/></alternatives></inline-formula>, in the Cartesian product of <inline-formula><tex-math notation="LaTeX">$P$</tex-math><alternatives><inline-graphic xlink:href="yu-ieq4-2836464.gif"/></alternatives></inline-formula> and <inline-formula><tex-math notation="LaTeX">$Q$</tex-math><alternatives><inline-graphic xlink:href="yu-ieq5-2836464.gif"/></alternatives></inline-formula> such that <inline-formula><tex-math notation="LaTeX">$(p,q)$</tex-math><alternatives><inline-graphic xlink:href="yu-ieq6-2836464.gif"/></alternatives></inline-formula> satisfies a stipulated similarity condition. In a high-dimensional space, an approximate similarity join based on locality-sensitive hashing (LSH) provides a good solution while reducing the processing cost with a predictable loss of accuracy. A distributed processing framework such as MapReduce allows the handling of large and high-dimensional datasets. However, network cost estimation frequently turns into a bottleneck in a distributed processing environment, thus resulting in a challenge of achieving faster and more efficient similarity join. This paper focuses on collision counting LSH-based similarity join in MapReduce and proposes a network-efficient solution called C2Net to improve the utilization of MapReduce combiners. The solution uses two graph partitioning schemes: (i) <italic>minimum spanning tree</italic> for organizing LSH buckets replication; and (ii) <italic>spectral clustering</italic> for runtime collision counting task scheduling. Experiments have shown that, in comparison to the state of the art, the proposed solution is able to achieve 20 percent data reduction and 50 percent reduction in shuffle time.

[1]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[2]  Dechang Pi,et al.  Approximate String Similarity Join using Hashing Techniques under Edit Distance Constraints , 2014, J. Softw..

[3]  Wilfred Ng,et al.  Locality-sensitive hashing scheme based on dynamic collision counting , 2012, SIGMOD Conference.

[4]  Dehua Chen,et al.  Efficient Similarity Join for Time Sequences Using Locality Sensitive Hash and Mapreduce , 2013, 2013 International Conference on Cloud Computing and Big Data.

[5]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[6]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[7]  Qiang Huang,et al.  Query-Aware Locality-Sensitive Hashing for Approximate Nearest Neighbor Search , 2015, Proc. VLDB Endow..

[8]  Kyuseok Shim,et al.  High-Dimensional Similarity Joins , 2002, IEEE Trans. Knowl. Data Eng..

[9]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[10]  Muthu Dayalan,et al.  MapReduce : Simplified Data Processing on Large Cluster , 2018 .

[11]  Cong Wang,et al.  A Generic Method for Accelerating LSH-Based Similarity Join Processing , 2017, IEEE Transactions on Knowledge and Data Engineering.

[12]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[13]  Pavel Zezula,et al.  Similarity Join in Metric Spaces Using eD-Index , 2003, DEXA.

[14]  Hans-Peter Kriegel,et al.  Efficient processing of spatial joins using R-trees , 1993, SIGMOD Conference.

[15]  Meikang Qiu,et al.  Phase-Reconfigurable Shuffle Optimization for Hadoop MapReduce , 2020, IEEE Transactions on Cloud Computing.

[16]  Chen Lin,et al.  MapReduce Based Personalized Locality Sensitive Hashing for Similarity Joins on Large Scale Data , 2015, Comput. Intell. Neurosci..

[17]  Philip S. Yu,et al.  Top-k Similarity Join in Heterogeneous Information Networks , 2015, IEEE Transactions on Knowledge and Data Engineering.

[18]  Pavel Zezula,et al.  D-Index: Distance Searching Index for Metric Data Sets , 2003, Multimedia Tools and Applications.

[19]  Hai Jin,et al.  LEEN: Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[20]  Pradeep Dubey,et al.  Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing , 2013, Proc. VLDB Endow..

[21]  Zi Huang,et al.  SK-LSH: An Efficient Index Structure for Approximate Nearest Neighbor Search , 2014, Proc. VLDB Endow..

[22]  David A. Bader,et al.  Fast Shared-Memory Algorithms for Computing the Minimum Spanning Forest of Sparse Graphs , 2004, IPDPS.

[23]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[24]  Mohammad Hammoud,et al.  Center-of-Gravity Reduce Task Scheduling to Lower MapReduce Network Traffic , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[25]  Hai Jin,et al.  Maestro: Replica-Aware Map Scheduling for MapReduce , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[26]  Daniel P. Huttenlocher,et al.  Efficient Graph-Based Image Segmentation , 2004, International Journal of Computer Vision.

[27]  Sebastian Michel,et al.  RankReduce - Processing K-Nearest Neighbor Queries on Top of MapReduce , 2010, LSDS-IR@SIGIR.

[28]  Euripides G. M. Petrakis,et al.  Similarity Searching in Medical Image Databases , 1997, IEEE Trans. Knowl. Data Eng..

[29]  Wenming Qiu,et al.  Efficient k-Nearest Neighbors Search in High Dimensions Using MapReduce , 2015, 2015 IEEE Fifth International Conference on Big Data and Cloud Computing.

[30]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[31]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[32]  Hanan Samet,et al.  Distance browsing in spatial databases , 1999, TODS.

[33]  D. Tahmoush,et al.  High-dimensional similarity retrieval using dimensional choice , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[34]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[35]  Venkatesh Saligrama,et al.  Spectral clustering with imbalanced data , 2013, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).