Cluster-and-Conquer: When Randomness Meets Graph Locality

K-Nearest-Neighbors (KNN) graphs are central to many emblematic data mining and machine-learning applications. Some of the most efficient KNN graph algorithms are incremental and local: they start from a random graph, which they incrementally improve by traversing neighbors-of-neighbors links. Paradoxically, this random start is also one of the key weaknesses of these algorithms: nodes are initially connected to dissimilar neighbors, that lie far away according to the similarity metric. As a result, incremental algorithms must first laboriously explore spurious potential neighbors before they can identify similar nodes, and start converging. In this paper, we remove this drawback with Cluster-and-Conquer (C 2 for short). Cluster-and-Conquer boosts the starting configuration of greedy algorithms thanks to a novel lightweight clustering mechanism, dubbed FastRandomHash. FastRandomHash leverages random-ness and recursion to pre-cluster similar nodes at a very low cost. Our extensive evaluation on real datasets shows that Cluster-and-Conquer significantly outperforms existing approaches, including LSH, yielding speed-ups of up to x4.42 while incurring only a negligible loss in terms of KNN quality.

[1]  Andrew W. Moore,et al.  An Investigation of Practical Approximate Nearest Neighbor Algorithms , 2004, NIPS.

[2]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[3]  Christian S. Jensen,et al.  PM-LSH , 2020, Proc. VLDB Endow..

[4]  Qiang Yang,et al.  Scalable collaborative filtering using cluster-based smoothing , 2005, SIGIR '05.

[5]  Rasmus Pagh,et al.  Scalable and Robust Set Similarity Join , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[6]  Jure Leskovec,et al.  Friendship and mobility: user movement in location-based social networks , 2011, KDD.

[7]  John Riedl,et al.  Item-based collaborative filtering recommendation algorithms , 2001, WWW '01.

[8]  Bart Preneel,et al.  Hash functions , 2005, Encyclopedia of Cryptography and Security.

[9]  Mathias Bæk Tejs Knudsen,et al.  Fast Similarity Sketching , 2017, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[10]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Greg Linden,et al.  Amazon . com Recommendations Item-to-Item Collaborative Filtering , 2001 .

[12]  Deng Cai A Revisit of Hashing Algorithms for Approximate Nearest Neighbor Search , 2016, ArXiv.

[13]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[14]  Ahmed Eldawy,et al.  LARS: A Location-Aware Recommender System , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[15]  François Taïani,et al.  Smaller, Faster & Lighter KNN Graph Constructions , 2020, WWW.

[16]  Wei Wang,et al.  I-LSH: I/O Efficient c-Approximate Nearest Neighbor Search in High-Dimensional Space , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[17]  C. Lanczos An iteration method for the solution of the eigenvalue problem of linear differential and integral operators , 1950 .

[18]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[19]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[20]  Pierre Vandergheynst,et al.  Accelerated spectral clustering using graph filtering of random signals , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Anne-Marie Kermarrec,et al.  BLIP: Non-interactive Differentially-Private Similarity Computation on Bloom filters , 2012, SSS.

[22]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[23]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[24]  Anne-Marie Kermarrec,et al.  Nobody Cares if You Liked Star Wars: KNN Graph Construction on the Cheap , 2018, Euro-Par.

[25]  Anne-Marie Kermarrec,et al.  HyRec: leveraging browsers for scalable recommenders , 2014, Middleware.

[26]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[27]  John Riedl,et al.  GroupLens: an open architecture for collaborative filtering of netnews , 1994, CSCW '94.

[28]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[29]  K. S. Sridharan,et al.  Employing bloom filters for privacy preserving distributed collaborative kNN classification , 2011, 2011 World Congress on Information and Communication Technologies.

[30]  Anne-Marie Kermarrec,et al.  Being prepared in a sparse world: The case of KNN graph construction , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[31]  Kaizhu Huang,et al.  Fast kNN Graph Construction with Locality Sensitive Hashing , 2013, ECML/PKDD.

[32]  Dimitrios Tsoumakos,et al.  Rapid AkNN Query Processing for Fast Classification of Multidimensional Data in the Cloud , 2014, ArXiv.

[33]  Jure Leskovec,et al.  From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews , 2013, WWW.

[34]  Yousef Saad,et al.  Fast Approximate kNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection , 2009, J. Mach. Learn. Res..

[35]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[36]  Kai Li,et al.  Efficient k-nearest neighbor graph construction for generic similarity measures , 2011, WWW.

[37]  Michael E. Houle,et al.  NN-Descent on High-Dimensional Data , 2018, WIMS.

[38]  Fernando Díez,et al.  Simple time-biased KNN-based recommendations , 2010, CAMRa '10.

[39]  Anh Duc Duong,et al.  Addressing cold-start problem in recommendation systems , 2008, ICUIMC '08.

[40]  Ping Li,et al.  Theory and applications of b-bit minwise hashing , 2011, Commun. ACM.

[41]  Jure Leskovec,et al.  Defining and evaluating network communities based on ground-truth , 2012, Knowledge and Information Systems.

[42]  Anne-Marie Kermarrec,et al.  Fingerprinting Big Data: The Case of KNN Graph Construction , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).