Parallel similarity joins on massive high‐dimensional data using MapReduce

In this paper, we focus on high‐dimensional similarity join (HDSJ) using MapReduce paradigm. As the volume of the data and the number of the dimensions increase, the computation cost of HDSJ will increase exponentially. There is no existing effective approach that can process HDSJ efficiently, so we propose a novel method called symbolic aggregate approximation (SAX)‐based HDSJ to deal with the problem. SAX is the abbreviation of symbolic aggregate approximation that is a dimensionality reduction technique and widely used in time series processing, we use SAX to represent the high‐dimensional vectors in this paper and reorganize these vectors into groups based on their SAX representations. For the very high‐dimensional vectors, we also propose an improved SAX‐based HDSJ approach. Finally, we implement SAX‐based HDSJ and improved SAX‐based HDSJ on Hadoop‐0.20.2 and perform comprehensive experiments to test the performance, we also compare SAX‐based HDSJ and improved SAX‐based HDSJ with the existing method. The experiment results show that our proposed approaches have much better performance than that of the existing method. Copyright © 2015 John Wiley & Sons, Ltd.

[1]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[2]  Christian Böhm,et al.  High performance clustering based on the similarity join , 2000, CIKM '00.

[3]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[4]  Christos Faloutsos,et al.  V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors , 2012, Proc. VLDB Endow..

[5]  Aditya G. Parameswaran,et al.  Fuzzy Joins Using MapReduce , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[6]  Feifei Li,et al.  K nearest neighbor queries and kNN-Joins in large relational databases (almost) for free , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[7]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[8]  Jeffrey D. Ullman,et al.  Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Beng Chin Ooi,et al.  Efficient Processing of k Nearest Neighbor Joins using MapReduce , 2012, Proc. VLDB Endow..

[11]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[12]  Marios Hadjieleftheriou,et al.  Efficient trajectory joins using symbolic representations , 2005, MDM '05.

[13]  Mirek Riedewald,et al.  Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[14]  Jimmy J. Lin,et al.  Pairwise Document Similarity in Large Collections with MapReduce , 2008, ACL.

[15]  Younghoon Kim,et al.  Parallel Top-K Similarity Join Algorithms Using MapReduce , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[16]  Feifei Li,et al.  Efficient parallel kNN joins for large data in MapReduce , 2012, EDBT '12.

[17]  Min Wang,et al.  Efficient Multi-way Theta-Join Processing Using MapReduce , 2012, Proc. VLDB Endow..

[18]  Vagelis Hristidis,et al.  Authority-based keyword search in databases , 2008, TODS.

[19]  Kyuseok Shim,et al.  High-dimensional similarity joins , 1997, Proceedings 13th International Conference on Data Engineering.

[20]  Hanan Samet,et al.  Metric space similarity joins , 2008, TODS.

[21]  Christian Böhm,et al.  Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data , 2001, SIGMOD '01.

[22]  Ranieri Baraglia,et al.  Document Similarity Self-Join with MapReduce , 2010, 2010 IEEE International Conference on Data Mining.

[23]  Lionel M. Ni,et al.  Efficient Similarity Joins on Massive High-Dimensional Datasets Using MapReduce , 2012, 2012 IEEE 13th International Conference on Mobile Data Management.

[24]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[25]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[26]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.