K-Nearest Neighbor Search by Random Projection Forests

K-nearest neighbors (kNN) search is an important problem in data mining and knowledge discovery. Inspired by the huge success of tree-based methodology and ensemble methods over the last decades, we propose a new method for kNN search, random projection forests (rpForests). rpForests finds nearest neighbors by combining multiple kNN-sensitive trees with each constructed recursively through a series of carefully chosen random projections. As demonstrated by experiments on a wide collection of real datasets, our method achieves a remarkable accuracy in terms of fast decaying missing rate of kNNs and that of discrepancy in the k-th nearest neighbor distances. rpForests has a very low computational complexity as a tree-based methodology. The ensemble nature of rpForests makes it easily parallelized to run on clustered or multicore computers; the running time is expected to be nearly inversely proportional to the number of cores or machines. We give theoretical insights on rpForests by showing the exponential decay of neighboring points being separated by ensemble random projection trees when the ensemble size increases. Our theory can also be used to refine the choice of random projections in the growth of rpForests; experiments show that the effect is remarkable.

[1]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[2]  R. Samworth,et al.  Random‐projection ensemble classification , 2015, 1504.04595.

[3]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[4]  Ling Huang,et al.  Fast approximate spectral clustering , 2009, KDD.

[5]  Anil K. Jain,et al.  Clustering Millions of Faces by Identity , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Michael I. Jordan,et al.  Cluster Forests , 2011, Comput. Stat. Data Anal..

[7]  Peter J. Bickel,et al.  Maximum Likelihood Estimation of Intrinsic Dimension , 2004, NIPS.

[8]  Donghoon Lee,et al.  Fast and Accurate Head Pose Estimation via Random Projection Forests , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Jitendra Malik,et al.  Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Purushottam Kar,et al.  Random Projection Trees Revisited , 2010, NIPS.

[11]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[12]  Jian Zou,et al.  Incorporating Deep Features in the Analysis of Tissue Microarray Images , 2018, Statistics and its interface.

[13]  Sanjoy Dasgupta,et al.  Random projection trees and low dimensional manifolds , 2008, STOC.

[14]  Yu He,et al.  Statistical Significance of the Netflix Challenge , 2012, 1207.5649.

[15]  Sotiris K. Tasoulis,et al.  Fast nearest neighbor search through sparse random projections and voting , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[17]  Donghui Yan,et al.  The Turtleback Diagram for Conditional Probability , 2018, ArXiv.

[18]  David G. Lowe,et al.  Scalable Nearest Neighbor Algorithms for High Dimensional Data , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Apostolos N. Papadopoulos,et al.  Nearest Neighbor Search:: A Database Perspective , 2004 .

[20]  M. Rosenblatt,et al.  Multivariate k-nearest neighbor density estimates , 1979 .

[21]  P. Bickel,et al.  Sparsity and the Possibility of Inference , 2008 .

[22]  W.R. Mark,et al.  Fast kd-tree Construction with an Adaptive Error-Bounded Heuristic , 2006, 2006 IEEE Symposium on Interactive Ray Tracing.

[23]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[24]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[25]  Mohammed Otair,et al.  Approximate k-nearest neighbour based spatial clustering using k-d tree , 2013, ArXiv.

[26]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[27]  P. Bickel,et al.  Sums of Functions of Nearest Neighbor Distances, Moment Bounds, Limit Theorems and a Goodness of Fit Test , 1983 .

[28]  Kun Zhou,et al.  Real-time KD-tree construction on graphics hardware , 2008, SIGGRAPH Asia '08.

[29]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[30]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[31]  J. Yukich,et al.  Laws of Large Numbers and Nearest Neighbor Distances , 2009, 0911.0331.

[32]  Sanjoy Dasgupta,et al.  Randomized Partition Trees for Nearest Neighbor Search , 2014, Algorithmica.

[33]  Andrew W. Moore,et al.  An Investigation of Practical Approximate Nearest Neighbor Algorithms , 2004, NIPS.

[34]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[35]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[36]  Kaushik Sinha,et al.  LSH vs Randomized Partition Trees: Which One to Use for Nearest Neighbor Search? , 2014, 2014 13th International Conference on Machine Learning and Applications.

[37]  Malgorzata Lucinska,et al.  Spectral Clustering Based on k-Nearest Neighbor Graph , 2012, CISIM.

[38]  Kai Li,et al.  Efficient k-nearest neighbor graph construction for generic similarity measures , 2011, WWW.

[39]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[40]  Bernt Schiele,et al.  Efficient Clustering and Matching for Object Class Recognition , 2006, BMVC.

[41]  Dan Halperin,et al.  Efficient high-quality motion planning by fast all-pairs r-nearest-neighbors , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[42]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[43]  Richard I. Hartley,et al.  Optimised KD-trees for fast image descriptor matching , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[45]  C. Quesenberry,et al.  A nonparametric estimate of a multivariate density function , 1965 .