Fingerprinting Big Data: The Case of KNN Graph Construction

We propose fingerprinting, a new technique that consists in constructing compact, fast-to-compute and privacy-preserving binary representations of datasets. We illustrate the effectiveness of our approach on the emblematic big data problem of K-Nearest-Neighbor (KNN) graph construction and show that fingerprinting can drastically accelerate a large range of existing KNN algorithms, while efficiently obfuscating the original data, with little to no overhead. Our extensive evaluation of the resulting approach (dubbed GoldFinger) on several realistic datasets shows that our approach delivers speedups of up to 78.9% compared to the use of raw data while only incurring a negligible to moderate loss in terms of KNN quality.

[1]  Anne-Marie Kermarrec,et al.  Gossiping personalized queries , 2010, EDBT '10.

[2]  Anne-Marie Kermarrec,et al.  Nobody Cares if You Liked Star Wars: KNN Graph Construction on the Cheap , 2018, Euro-Par.

[3]  Dimitrios Tsoumakos,et al.  Rapid AkNN Query Processing for Fast Classification of Multidimensional Data in the Cloud , 2014, ArXiv.

[4]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[5]  Anh Duc Duong,et al.  Addressing cold-start problem in recommendation systems , 2008, ICUIMC '08.

[6]  Heng Tao Shen,et al.  Exploring Bit-Difference for Approximate KNN Search in High-dimensional Databases , 2005, ADC.

[7]  Alexandros Labrinidis,et al.  Exploring the tradeoff between performance and data freshness in database-driven Web servers , 2004, The VLDB Journal.

[8]  John Riedl,et al.  GroupLens: an open architecture for collaborative filtering of netnews , 1994, CSCW '94.

[9]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[10]  Anne-Marie Kermarrec,et al.  Being prepared in a sparse world: The case of KNN graph construction , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[11]  Kimberly Keeton,et al.  LazyBase: freshness vs. performance in information management , 2010, OPSR.

[12]  Anne-Marie Kermarrec,et al.  The Gossple Anonymous Social Network , 2010, Middleware.

[13]  Greg Linden,et al.  Amazon . com Recommendations Item-to-Item Collaborative Filtering , 2001 .

[14]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[15]  Yousef Saad,et al.  Fast Approximate kNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection , 2009, J. Mach. Learn. Res..

[16]  Andrew W. Moore,et al.  An Investigation of Practical Approximate Nearest Neighbor Algorithms , 2004, NIPS.

[17]  Kai Li,et al.  Efficient k-nearest neighbor graph construction for generic similarity measures , 2011, WWW.

[18]  Ely Porat,et al.  Sketching for Big Data Recommender Systems Using Fast Pseudo-random Fingerprints , 2013, ICALP.

[19]  Andrew W. Moore,et al.  The Anchors Hierarchy: Using the Triangle Inequality to Survive High Dimensional Data , 2000, UAI.

[20]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[21]  Anne-Marie Kermarrec,et al.  HyRec: leveraging browsers for scalable recommenders , 2014, Middleware.

[22]  Ping Li,et al.  Theory and applications of b-bit minwise hashing , 2011, Commun. ACM.

[23]  Ahmed Eldawy,et al.  LARS: A Location-Aware Recommender System , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[24]  K. S. Sridharan,et al.  Employing bloom filters for privacy preserving distributed collaborative kNN classification , 2011, 2011 World Congress on Information and Communication Technologies.

[25]  Jure Leskovec,et al.  From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews , 2013, WWW.