The k-Means Forest Classifier for High Dimensional Data

The priority search k-means tree algorithm is the most effective k-nearest neighbor algorithm for high dimensional data as far as we know. However, this algorithm is sensitive to attribute noise which is common in high dimensional spaces. Therefore, this paper presents a new method named k-means forest that combines the priority search k-means tree algorithm with random forest. The main idea is to create multiple priority search k-means trees by randomly selecting a fixed number of attributes to make decisions and get the final result by voting. We also design a parallel version for the algorithm. The experimental results on artificial and public benchmark data sets demonstrate the effectiveness of the proposed method.

[1]  David G. Lowe,et al.  Shape indexing using approximate nearest-neighbour search in high-dimensional spaces , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[2]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[3]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[5]  Jing Wang,et al.  Scalable k-NN graph construction for visual descriptors , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  David Nistér,et al.  Scalable Object Recognition Using Hierarchical Quantization with a Vocabulary Tree , 2017 .

[7]  David G. Lowe,et al.  Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[8]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Ahmed F. Ghoniem,et al.  K-means clustering for optimal partitioning and dynamic load balancing of parallel hierarchical N-body simulations , 2005 .

[10]  Wei Liu,et al.  Scalable similarity search with optimized kernel hashing , 2010, KDD.

[11]  Kimmo Fredriksson,et al.  Geometric Near-neighbor Access Tree (GNAT) revisited , 2016, ArXiv.

[12]  Kristen Grauman,et al.  Kernelized locality-sensitive hashing for scalable image search , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[13]  Richard Szeliski,et al.  City-Scale Location Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Andrew W. Moore,et al.  An Investigation of Practical Approximate Nearest Neighbor Algorithms , 2004, NIPS.

[15]  Yasin Abbasi-Yadkori,et al.  Fast Approximate Nearest-Neighbor Search with k-Nearest Neighbor Graph , 2011, IJCAI.

[16]  Victor Lempitsky,et al.  The inverted multi-index , 2012, CVPR.

[17]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[18]  David G. Lowe,et al.  Scalable Nearest Neighbor Algorithms for High Dimensional Data , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Shih-Fu Chang,et al.  Semi-supervised hashing for scalable image retrieval , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[20]  Bernt Schiele,et al.  Efficient Clustering and Matching for Object Class Recognition , 2006, BMVC.

[21]  Richard I. Hartley,et al.  Optimised KD-trees for fast image descriptor matching , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Svetlana Lazebnik,et al.  Locality-sensitive binary codes from shift-invariant kernels , 2009, NIPS.

[23]  Fei-Fei Li,et al.  Novel Dataset for Fine-Grained Image Categorization : Stanford Dogs , 2012 .

[24]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[25]  Nenghai Yu,et al.  Complementary hashing for approximate nearest neighbor search , 2011, 2011 International Conference on Computer Vision.

[26]  Benjamin B. Kimia,et al.  Metric-based shape retrieval in large databases , 2002, Object recognition supported by user interaction for service robots.

[27]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[28]  Hans-Peter Seidel,et al.  Stackless KD‐Tree Traversal for High Performance GPU Ray Tracing , 2007, Comput. Graph. Forum.