Fast, incremental, and scalable all pairs similarity search

Searching pairs of similar data records is an operation required for many data mining techniques like clustering and collaborative filtering. With the emergence of the Web, scale of the data has increased to several millions or billions of records. Business and scientific applications like search engines, digital libraries, and systems biology often deal with massive datasets in a high dimensional space. The overarching goal of this dissertation is to enable fast and incremental similarity search over large high dimensional datasets through improved indexing, systematic heuristic optimizations, and scalable parallelization. In Task 1, we design a sequential algorithm for All Pairs Similarity Search (APSS) that involves finding all pairs of records having similarity above a specified threshold. Our proposed fast matching technique speeds-up APSS computation by using novel tighter bounds for similarity computation and indexing data structure. It offers the fastest solution known to date with up to 6X speed-up over the state-of-the-art existing APSS algorithm. In Task 2, we address the incremental formulation of the APSS problem, where APSS is performed multiple times over a given dataset while varying the similarity threshold. Our goal is to avoid redundant computations across multiple invocations of APSS by storing computation history during each APSS. Depending on the similarity threshold variation, our proposed history binning and index splitting techniques achieve speed-ups from 2X to over 105X over the state-of-the-art APSS algorithm. To the best of our knowledge, this is the first work that addresses this problem. In Task 3, we design scalable parallel algorithms for APSS that take advantage of modern multi-processor, multi-core architectures to further scale-up the APSS computation. Our proposed index sharing technique divides the APSS computation into independent tasks and achieves ideal strong scaling behavior on shared memory architectures. We also propose a complementary incremental index sharing technique, which provides a memory-efficient parallel APSS solution while maintaining almost linear speed-up. Performance of our parallel APSS algorithms remains consistent for datasets of various sizes. To the best of our knowledge, this is the first work that explores parallelization for APSS. We demonstrate the effectiveness of our techniques using four record datasets.

[1]  Mehran Sahami,et al.  Evaluating similarity measures: a large-scale study in the orkut social network , 2005, KDD '05.

[2]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near duplicate detection , 2008, WWW.

[3]  Xuemin Lin,et al.  Top-k Set Similarity Joins , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[4]  Nagiza F. Samatova,et al.  Incremental all pairs similarity search for varying similarity thresholds , 2009, SNA-KDD '09.

[5]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[6]  M. Newman,et al.  Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[7]  Ophir Frieder,et al.  Collection statistics for fast duplicate document detection , 2002, TOIS.

[8]  Jeffrey Dean,et al.  Challenges in building large-scale information retrieval systems: invited talk , 2009, WSDM '09.

[9]  Arkady B. Zaslavsky,et al.  Efficiency of data structures for detecting overlaps in digital documents , 2001, Proceedings 24th Australian Computer Science Conference. ACSC 2001.

[10]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[11]  Steve Chien,et al.  Semantic similarity between search engine queries using temporal correlation , 2005, WWW '05.

[12]  Ron Sacks-Davis,et al.  Fast Document Ranking for Large Scale Information Retrieval , 1994, ADB.

[13]  Divesh Srivastava,et al.  Fast Indexes and Algorithms for Set Similarity Selection Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[14]  Reda Alhajj,et al.  High performance computing for spatial outliers detection using parallel wavelet transform , 2007, Intell. Data Anal..

[15]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[16]  Hanan Samet,et al.  Incremental distance join algorithms for spatial databases , 1998, SIGMOD '98.

[17]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[18]  Nagiza F. Samatova,et al.  pR: Lightweight, Easy-to-Use Middleware to Plugin Parallel Analytical Computing with R , 2009, IKE.

[19]  Dongwon Lee,et al.  Parallel linkage , 2007, CIKM '07.

[20]  Howard R. Turtle,et al.  Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[21]  Divyakant Agrawal,et al.  Detectives: detecting coalition hit inflation attacks in advertising networks streams , 2007, WWW '07.

[22]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[23]  W. Bruce Croft,et al.  Optimization strategies for complex queries , 2005, SIGIR '05.

[24]  Stephen Blott,et al.  What's wrong with high-dimensional similarity search? , 2008, Proc. VLDB Endow..

[25]  Bin Wang,et al.  VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams , 2007, VLDB.

[26]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[27]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[28]  Xuemin Lin,et al.  Ed-Join: an efficient algorithm for similarity joins with edit distance constraints , 2008, Proc. VLDB Endow..

[29]  Hanan Samet,et al.  A Fast Similarity Join Algorithm Using Graphics Processing Units , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[30]  Seung-won Hwang,et al.  Minimal probing: supporting expensive predicates for top-k queries , 2002, SIGMOD '02.

[31]  Shenghuo Zhu,et al.  Learning multiple graphs for document recommendations , 2008, WWW.

[32]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[33]  Michel Barlaud,et al.  Fast k nearest neighbor search using GPU , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[34]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[35]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[36]  Pabitra Mitra,et al.  Selective hypertext induced topic search , 2006, WWW '06.

[37]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[38]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[39]  Sergei Vassilvitskii,et al.  Top-k aggregation using intersections of ranked inputs , 2009, WSDM '09.

[40]  Bil Lewis,et al.  Multithreaded Programming With PThreads , 1997 .

[41]  Jiangchuan Liu,et al.  Statistics and Social Network of YouTube Videos , 2008, 2008 16th Interntional Workshop on Quality of Service.

[42]  Hector Garcia-Molina,et al.  Building a scalable and accurate copy detection mechanism , 1996, DL '96.

[43]  Hector Garcia-Molina,et al.  Finding Near-Replicas of Documents and Servers on the Web , 1998, WebDB.

[44]  Marc Najork,et al.  On the evolution of clusters of near-duplicate Web pages , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[45]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[46]  Chris Buckley,et al.  Optimization of inverted vector searches , 1985, SIGIR '85.

[47]  Dimitris Papadias,et al.  Top-k spatial joins , 2005, IEEE Transactions on Knowledge and Data Engineering.

[48]  Pavel Zezula,et al.  A distributed incremental nearest neighbor algorithm , 2007 .

[49]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[50]  Ravi Kumar,et al.  Discovering Large Dense Subgraphs in Massive Graphs , 2005, VLDB.

[51]  Marc Najork,et al.  Detecting phrase-level duplication on the world wide web , 2005, SIGIR '05.

[52]  Soumen Chakrabarti,et al.  Mining the web - discovering knowledge from hypertext data , 2002 .

[53]  James H. Anderson,et al.  On the Design and Implementation of a Cache-Aware Multicore Real-Time Scheduler , 2009, 2009 21st Euromicro Conference on Real-Time Systems.

[54]  Nagiza F. Samatova,et al.  Fast Matching for All Pairs Similarity Search , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[55]  Eibe Frank,et al.  An Empirical Comparison of Exact Nearest Neighbour Algorithms , 2007, PKDD.

[56]  Jaewoo Kang,et al.  Selective Approach To Handling Topic Oriented Tasks On The World Wide Web , 2007, 2007 IEEE Symposium on Computational Intelligence and Data Mining.

[57]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[58]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[59]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[60]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[61]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[62]  Ray A. Jarvis,et al.  Clustering Using a Similarity Measure Based on Shared Near Neighbors , 1973, IEEE Transactions on Computers.

[63]  Christian Böhm,et al.  High performance clustering based on the similarity join , 2000, CIKM '00.

[64]  Ravi Kumar,et al.  Structure and evolution of online social networks , 2006, KDD '06.

[65]  Fenglou Mao,et al.  Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[66]  D. Geer,et al.  Chip makers turn to multicore processors , 2005, Computer.