The out-of-core KNN awakens: the light side of computation force on large datasets

K-Nearest Neighbors (KNN) is a crucial tool for many applications, e.g. recommender systems, image classification and web-related applications. However, KNN is a resource greedy operation particularly for large datasets. We focus on the challenge of KNN computation over large datasets on a single commodity PC with limited memory. We propose a novel approach to compute KNN on large datasets by leveraging both disk and main memory efficiently. The main rationale of our approach is to minimize random accesses to disk, maximize sequential accesses to data and efficient usage of only the available memory.

[1]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[2]  Eli Shechtman,et al.  In defense of Nearest-Neighbor based image classification , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Yousef Saad,et al.  Fast Approximate kNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection , 2009, J. Mach. Learn. Res..

[4]  Nikos Mamoulis,et al.  Secure kNN computation on encrypted databases , 2009, SIGMOD Conference.

[5]  Kai Li,et al.  Efficient k-nearest neighbor graph construction for generic similarity measures , 2011, WWW.

[6]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[8]  Anne-Marie Kermarrec,et al.  HyRec: leveraging browsers for scalable recommenders , 2014, Middleware.

[9]  Minsuk Kahng,et al.  MMap: Fast billion-scale graph computation on a PC via memory mapping , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[10]  Ronald E. McRoberts,et al.  Stratified estimation of forest area using satellite imagery, inventory data, and the k-Nearest Neighbors technique , 2002 .

[11]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[12]  Guy E. Blelloch,et al.  GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[13]  Keinosuke Fukunaga,et al.  A Branch and Bound Algorithm for Computing k-Nearest Neighbors , 1975, IEEE Transactions on Computers.

[14]  Jinha Kim,et al.  TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC , 2013, KDD.

[15]  Matthijs Douze,et al.  Searching in one billion vectors: Re-rank with source coding , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[17]  Wenguang Chen,et al.  GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning , 2015, USENIX ATC.

[18]  Anne-Marie Kermarrec,et al.  Scaling KNN computation over large graphs on a PC , 2014, Middleware.

[19]  Anne-Marie Kermarrec,et al.  WHATSUP: A Decentralized Instant News Recommender , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[20]  Willy Zwaenepoel,et al.  X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.

[21]  Olivier Thonnard,et al.  Building k-nn graphs from large text data , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[22]  Anne-Marie Kermarrec,et al.  Privacy-Preserving Distributed Collaborative Filtering , 2014, NETYS.