On speeding up the implementation of nearest neighbour search and classification

The paper presents practical approaches and techniques to speeding up implementations of nearest neighbour search/classification algorithm for high dimensional data and/or many training examples. Such settings often appear in the fields of big data and data mining. We apply a fast iterative form of polar decomposition and use the computed matrix to pre-select smaller number of candidate classes for the query element. We show that additional speed up can be achieved when the training classes consists of many instances by subdividing them in subclasses by fast approximation of some clustering algorithm and the resulting classification is used for building the decomposition matrix. Our pre-processing (depends linearly or near linearly on the number of examples and dimensions) and pre-selection steps (depends on number of classes) can be used with any well-known indexing method as annulus method, kd-trees, metric trees, r-trees, cover trees, etc to limit the training instances used in the search/classification process. Finally we introduce what we name cluster index and show that in practice it extends the applicability of the indexing structures with higher order complexity to bigger datasets.

[1]  Stephen M. Omohundro,et al.  Efficient Algorithms with Neural Network Behavior , 1987, Complex Syst..

[2]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[3]  E. M. Wright,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[4]  Vipin Kumar,et al.  Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data , 2003, SDM.

[5]  Robert M. Gray,et al.  An Improvement of the Minimum Distortion Encoding Algorithm for Vector Quantization , 1985, IEEE Trans. Commun..

[6]  Kristin P. Bennett,et al.  Density-based indexing for approximate nearest-neighbor queries , 1999, KDD '99.

[7]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[8]  Gennady Agre,et al.  An Expert System for Healthful and Dietary Nutrition , 2016, CompSysTech.

[9]  Hans-Peter Kriegel,et al.  Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data , 2009, PAKDD.

[10]  Philip S. Yu,et al.  On High Dimensional Indexing of Uncertain Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[11]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[12]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[13]  Andrew W. Moore,et al.  An Investigation of Practical Approximate Nearest Neighbor Algorithms , 2004, NIPS.

[14]  Ashraf M. Kibriya,et al.  Fast Algorithms for Nearest Neighbour Search , 2007 .

[15]  Hans-Peter Kriegel,et al.  Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? , 2010, SSDBM.

[16]  Å. Björck,et al.  An Iterative Algorithm for Computing the Best Estimate of an Orthogonal Matrix , 1971 .

[17]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[18]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[19]  Richard E. Ladner,et al.  Nearest neighbor search for data compression , 1999, Data Structures, Near Neighbor Searches, and Methodology.

[20]  Kwang-Ho Ro,et al.  Outlier detection for high-dimensional data , 2015 .

[21]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[22]  Eamonn J. Keogh Nearest Neighbor , 2010, Encyclopedia of Machine Learning.

[23]  Michael E. Houle,et al.  Navigating massive data sets via local clustering , 2003, KDD '03.