Similarity joins for high‐dimensional data using Spark

Similarity join on high‐dimensional data is a primitive operation. It is used to find all data pairs that with distance no more than ϵ from the given data set according to a specific distance measure. As the data set scale and dimension increase, computation cost increases vastly. Hadoop and Spark have become the popular platforms for big‐data analysis. Because Spark has native advantages in iterative computations, we adopted it as our platform to perform similarity joins on high‐dimensional data sets. In order to resolve problems such as data imbalance, data duplication, and redundant computation of existing works, we have proposed a new algorithm based on Symbolic aggregation and vertical decomposition. We first conduct dimension‐reduction using symbolic aggregation method. Then, we applied vertical partition operation on processed data. The join operations are performed on each vertical partition in parallel manner and the proposed new filters are utilized to prune false positives in early stage. Finally, the partial results generated from each partition will be aggregated and verified to get final results. Our proposed algorithm can significantly improve the efficiency of similarity joins on high‐dimensional data. In order to verify the efficiency and scalability of our methods, we implemented it using MapReduce and Spark. We compared our methods with existing works on public data sets, and the experimental results showed that the new methods were more efficient and scalable under different running environments.

[1]  Reza Bosagh Zadeh,et al.  Dimension Independent Matrix Square using MapReduce , 2013, ArXiv.

[2]  Xiaofeng Meng,et al.  Parallel similarity joins on massive high‐dimensional data using MapReduce , 2016, Concurr. Comput. Pract. Exp..

[3]  Feifei Li,et al.  Efficient parallel kNN joins for large data in MapReduce , 2012, EDBT '12.

[4]  Yeye He,et al.  ClusterJoin: A Similarity Joins Framework using Map-Reduce , 2014, Proc. VLDB Endow..

[5]  Beng Chin Ooi,et al.  Efficient Processing of k Nearest Neighbor Joins using MapReduce , 2012, Proc. VLDB Endow..

[6]  Srinivasan Parthasarathy,et al.  Scalable all-pairs similarity search in metric spaces , 2013, KDD.

[7]  Gang Chen,et al.  Metric Similarity Joins Using MapReduce , 2017, IEEE Transactions on Knowledge and Data Engineering.

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Peng Wang,et al.  An efficient MapReduce algorithm for similarity join in metric spaces , 2016, The Journal of Supercomputing.

[10]  Lionel M. Ni,et al.  Efficient Similarity Joins on Massive High-Dimensional Datasets Using MapReduce , 2012, 2012 IEEE 13th International Conference on Mobile Data Management.

[11]  Fabrizio Falchi,et al.  Local Feature based Image Similarity Functions for kNN Classification , 2011, ICAART.

[12]  David G. Lowe,et al.  Scalable Nearest Neighbor Algorithms for High Dimensional Data , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Ashish Goel,et al.  Dimension independent similarity computation , 2012, J. Mach. Learn. Res..

[14]  Martin L. Kersten,et al.  Efficient k-NN search on vertically decomposed data , 2002, SIGMOD '02.

[15]  George Karypis,et al.  Fast Parallel Cosine K-Nearest Neighbor Graph Construction , 2016, 2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3).

[16]  Miyoung Jang,et al.  A Density-Aware Similarity Join Query Processing Algorithm on MapReduce , 2016 .

[17]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[18]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[19]  Sebastian Michel,et al.  RankReduce - Processing K-Nearest Neighbor Queries on Top of MapReduce , 2010, LSDS-IR@SIGIR.

[20]  Agma J. M. Traina,et al.  Self Similarity Wide-Joins for Near-Duplicate Image Detection , 2015, 2015 IEEE International Symposium on Multimedia (ISM).

[21]  Yannis Manolopoulos,et al.  New plane-sweep algorithms for distance-based join queries in spatial databases , 2016, GeoInformatica.

[22]  Dimitrios Tsoumakos,et al.  kdANN+: A Rapid AkNN Classifier for Big Data , 2016, Trans. Large Scale Data Knowl. Centered Syst..

[23]  Gang Chen,et al.  kNN processing with co-space distance in SoLoMo systems , 2014, Expert Syst. Appl..

[24]  Long Zheng,et al.  Cloud-assisted spatio-textual k nearest neighbor joins in sensor networks , 2015, 2015 1st International Conference on Industrial Networks and Intelligent Systems (INISCom).

[25]  Alfredo Cuzzocrea,et al.  SJClust: Towards a Framework for Integrating Similarity Join Algorithms and Clustering , 2016, ICEIS.