kNNVWC: An Efficient k-Nearest Neighbors Approach Based on Various-Widths Clustering

The $k$ -nearest neighbor approach ( $k$ -NN) has been extensively used as a powerful non-parametric technique in many scientific and engineering applications. However, this approach incurs a large computational cost. Hence, this issue has become an active research field. In this work, a novel $k$ -NN approach based on various-widths clustering, named $k$ NNVWC, to efficiently find $k$ -NNs for a query object from a given data set, is presented. $k$ NNVWC does clustering using various widths, where a data set is clustered with a global width first and each produced cluster that meets the predefined criteria is recursively clustered with its own local width that suits its distribution. This reduces the clustering time, in addition to balancing the number of produced clusters and their respective sizes. Maximum efficiency is achieved by using triangle inequality to prune unlikely clusters. Experimental results demonstrate that $k$ NNVWC performs well in finding $k$ -NNs for query objects compared to a number of $k$ -NN search algorithms, especially for a data set with high dimensions, various distributions and large size.

[1]  Xinghuo Yu,et al.  An unsupervised anomaly-based detection approach for integrity attacks on SCADA systems , 2014, Comput. Secur..

[2]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[3]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[4]  Clara Pizzuti,et al.  Distance-based detection and prediction of outliers , 2006, IEEE Transactions on Knowledge and Data Engineering.

[5]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[6]  F. Gianfelici,et al.  Nearest-Neighbor Methods in Learning and Vision (Shakhnarovich, G. et al., Eds.; 2006) [Book review] , 2008 .

[7]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[8]  R. M. Chandrasekaran,et al.  Evaluation of k-Nearest Neighbor classifier performance for direct marketing , 2010, Expert Syst. Appl..

[9]  Forest Baskett,et al.  An Algorithm for Finding Nearest Neighbors , 1975, IEEE Transactions on Computers.

[10]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[11]  Xueyi Wang,et al.  A fast exact k-nearest neighbors algorithm for high dimensional search using k-means clustering and triangle inequality , 2011, The 2011 International Joint Conference on Neural Networks.

[12]  Hanan Samet,et al.  Distance browsing in spatial databases , 1999, TODS.

[13]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[14]  David G. Lowe,et al.  Scalable Nearest Neighbor Algorithms for High Dimensional Data , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Keinosuke Fukunaga,et al.  A Branch and Bound Algorithm for Computing k-Nearest Neighbors , 1975, IEEE Transactions on Computers.

[16]  Sarah Jane Delany k-Nearest Neighbour Classifiers , 2007 .

[17]  Christopher Leckie,et al.  Adaptive Clustering for Network Intrusion Detection , 2004, PAKDD.

[18]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[19]  Andrew W. Moore,et al.  New Algorithms for Efficient High-Dimensional Nonparametric Classification , 2006, J. Mach. Learn. Res..

[20]  Zahir Tari,et al.  SCADAVT-A framework for SCADA security testbed based on virtualization technology , 2013, 38th Annual IEEE Conference on Local Computer Networks.

[21]  Sergey Brin,et al.  Near Neighbor Search in Large Metric Spaces , 1995, VLDB.

[22]  Marimuthu Palaniswami,et al.  Labelled data collection for anomaly detection in wireless sensor networks , 2010, 2010 Sixth International Conference on Intelligent Sensors, Sensor Networks and Information Processing.

[23]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[24]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[25]  Fabrizio Angiulli,et al.  DOLPHIN: An efficient algorithm for mining distance-based outliers in very large datasets , 2009, TKDD.

[26]  Q. Henry Wu,et al.  Power Transformer Fault Classification Based on Dissolved Gas Analysis by Implementing Bootstrap and Genetic Programming , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[27]  Robert F. Sproull,et al.  Refinements to nearest-neighbor searching ink-dimensional trees , 1991, Algorithmica.

[28]  S. Magnussen,et al.  Model-based mean square error estimators for k-nearest neighbour predictions and applications using remotely sensed data for forest inventories , 2009 .

[29]  Leonid Portnoy,et al.  Intrusion detection with unlabeled data using clustering , 2000 .

[30]  Philip K. Chan,et al.  An Analysis of the 1999 DARPA/Lincoln Laboratory Evaluation Data for Network Anomaly Detection , 2003, RAID.

[31]  Srinivasan Parthasarathy,et al.  Fast mining of distance-based outliers in high-dimensional datasets , 2008, Data Mining and Knowledge Discovery.

[32]  Salvatore J. Stolfo,et al.  A Geometric Framework for Unsupervised Anomaly Detection , 2002, Applications of Data Mining in Computer Security.

[33]  Christos Faloutsos,et al.  Slim-Trees: High Performance Metric Trees Minimizing Overlap Between Nodes , 2000, EDBT.

[34]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[35]  Richard I. Hartley,et al.  Optimised KD-trees for fast image descriptor matching , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Shankar Vembu,et al.  Chemical gas sensor drift compensation using classifier ensembles , 2012 .

[37]  Sameer A. Nene,et al.  A simple algorithm for nearest neighbor search in high dimensions , 1997 .

[38]  Ada Wai-Chee Fu,et al.  Dynamic vp-tree indexing for n-nearest neighbor search given pair-wise distances , 2000, The VLDB Journal.

[39]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[40]  Song B. Park,et al.  A Fast k Nearest Neighbor Finding Algorithm Based on the Ordered Partition , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[42]  Shree K. Nayar,et al.  What Is a Good Nearest Neighbors Algorithm for Finding Similar Patches in Images? , 2008, ECCV.

[43]  Alexander Vergara,et al.  On the calibration of sensor arrays for pattern recognition using the minimal number of experiments , 2014 .