K-Nearest Neighbor Search by Random Projection Forests

K-nearest neighbor (kNN) search has wide applications in many areas, including data mining, machine learning, statistics and many applied domains. Inspired by the success of ensemble methods and the flexibility of tree-based methodology, we propose random projection forests, rpForests, for kNN search. rpForests finds kNNs by aggregating results from an ensemble of random projection trees with each constructed recursively through a series of carefully chosen random projections. rpForests achieves a remarkable accuracy in terms of fast decay in the missing rate of kNNs and that of discrepancy in the kNN distances. rpForests has a very low computational complexity. The ensemble nature of rpForests makes it easily run in parallel on multicore or clustered computers; the running time is expected to be nearly inversely proportional to the number of cores or machines. We give theoretical insights by showing the exponential decay of the probability that neighboring points would be separated by ensemble random projection trees when the ensemble size increases. Our theory can be used to refine the choice of random projections in the growth of trees, and experiments show that the effect is remarkable.

[1]  C. Quesenberry,et al.  A nonparametric estimate of a multivariate density function , 1965 .

[2]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[3]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[4]  M. Rosenblatt,et al.  Multivariate k-nearest neighbor density estimates , 1979 .

[5]  P. Bickel,et al.  Sums of Functions of Nearest Neighbor Distances, Moment Bounds, Limit Theorems and a Goodness of Fit Test , 1983 .

[6]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[7]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[8]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[9]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[10]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[11]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[12]  Jitendra Malik,et al.  Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[14]  Andrew W. Moore,et al.  An Investigation of Practical Approximate Nearest Neighbor Algorithms , 2004, NIPS.

[15]  Apostolos N. Papadopoulos,et al.  Nearest Neighbor Search:: A Database Perspective , 2004 .

[16]  Peter J. Bickel,et al.  Maximum Likelihood Estimation of Intrinsic Dimension , 2004, NIPS.

[17]  Bernt Schiele,et al.  Efficient Clustering and Matching for Object Class Recognition , 2006, BMVC.

[18]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[19]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[20]  W.R. Mark,et al.  Fast kd-tree Construction with an Adaptive Error-Bounded Heuristic , 2006, 2006 IEEE Symposium on Interactive Ray Tracing.

[21]  Sanjoy Dasgupta,et al.  Random projection trees and low dimensional manifolds , 2008, STOC.

[22]  Richard I. Hartley,et al.  Optimised KD-trees for fast image descriptor matching , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  P. Bickel,et al.  Sparsity and the Possibility of Inference , 2008 .

[24]  Ling Huang,et al.  Fast approximate spectral clustering , 2009, KDD.

[25]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[26]  Purushottam Kar,et al.  Random Projection Trees Revisited , 2010, NIPS.

[27]  J. Yukich,et al.  Laws of Large Numbers and Nearest Neighbor Distances , 2009, 0911.0331.

[28]  Kai Li,et al.  Efficient k-nearest neighbor graph construction for generic similarity measures , 2011, WWW.

[29]  Malgorzata Lucinska,et al.  Spectral Clustering Based on k-Nearest Neighbor Graph , 2012, CISIM.

[30]  Yu He,et al.  Statistical Significance of the Netflix Challenge , 2012, 1207.5649.

[31]  Dacheng Tao,et al.  A Survey on Multi-view Learning , 2013, ArXiv.

[32]  Mohammed Otair,et al.  Approximate k-nearest neighbour based spatial clustering using k-d tree , 2013, ArXiv.

[33]  Michael I. Jordan,et al.  Cluster Forests , 2011, Comput. Stat. Data Anal..

[34]  David G. Lowe,et al.  Scalable Nearest Neighbor Algorithms for High Dimensional Data , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Sanjoy Dasgupta,et al.  Randomized Partition Trees for Nearest Neighbor Search , 2014, Algorithmica.

[36]  Kaushik Sinha,et al.  LSH vs Randomized Partition Trees: Which One to Use for Nearest Neighbor Search? , 2014, 2014 13th International Conference on Machine Learning and Applications.

[37]  R. Samworth,et al.  Random‐projection ensemble classification , 2015, 1504.04595.

[38]  Dan Halperin,et al.  Efficient high-quality motion planning by fast all-pairs r-nearest-neighbors , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[39]  Donghoon Lee,et al.  Fast and Accurate Head Pose Estimation via Random Projection Forests , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  Sotiris K. Tasoulis,et al.  Fast nearest neighbor search through sparse random projections and voting , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[41]  Anil K. Jain,et al.  Clustering Millions of Faces by Identity , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Ming Shao,et al.  Robust Multi-view Representation: A Unified Perspective from Multi-view Learning to Domain Adaption , 2018, IJCAI.

[43]  Donghui Yan,et al.  The Turtleback Diagram for Conditional Probability , 2018, ArXiv.

[44]  Jian Zou,et al.  Incorporating Deep Features in the Analysis of Tissue Microarray Images , 2018, Statistics and its interface.