Clustering-based k-nearest neighbor classification for large-scale data with neural codes representation

Abstract While standing as one of the most widely considered and successful supervised classification algorithms, the k-nearest Neighbor (kNN) classifier generally depicts a poor efficiency due to being an instance-based method. In this sense, Approximated Similarity Search (ASS) stands as a possible alternative to improve those efficiency issues at the expense of typically lowering the performance of the classifier. In this paper we take as initial point an ASS strategy based on clustering. We then improve its performance by solving issues related to instances located close to the cluster boundaries by enlarging their size and considering the use of Deep Neural Networks for learning a suitable representation for the classification task at issue. Results using a collection of eight different datasets show that the combined use of these two strategies entails a significant improvement in the accuracy performance, with a considerable reduction in the number of distances needed to classify a sample in comparison to the basic kNN rule.

[1]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[2]  Sergios Theodoridis,et al.  Pattern Recognition , 1998, IEEE Trans. Neural Networks.

[3]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[4]  Chris Mellish,et al.  On the Consistency of Information Filters for Lazy Learning Algorithms , 1999, PKDD.

[5]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[6]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[7]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[8]  Georgios Evangelidis,et al.  RHC: a non-parametric cluster-based data reduction for efficient k\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\ , 2014, Pattern Analysis and Applications.

[9]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[10]  Juan Ramón Rico-Juan,et al.  Improving kNN multi-label classification in Prototype Selection scenarios using class proposals , 2015, Pattern Recognit..

[11]  Loris Nanni,et al.  Prototype reduction techniques: A comparison among different approaches , 2011, Expert Syst. Appl..

[12]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[13]  Heng Tao Shen,et al.  Hashing for Similarity Search: A Survey , 2014, ArXiv.

[14]  Francisco Herrera,et al.  kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data , 2017, Knowl. Based Syst..

[15]  Fabrizio Angiulli,et al.  Fast Nearest Neighbor Condensation for Large Data Sets Classification , 2007, IEEE Transactions on Knowledge and Data Engineering.

[16]  Kaiqi Huang,et al.  Learning Convolutional Nonlinear Features for K Nearest Neighbor Image Classification , 2014, 2014 22nd International Conference on Pattern Recognition.

[17]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[18]  Yoshua Bengio,et al.  Supplementary material for : How transferable are features in deep neural networks ? , 2014 .

[19]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[20]  E. Ruiz An algorithm for finding nearest neighbours in (approximately) constant average time , 1986 .

[21]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[22]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[23]  Alexander G. Gray,et al.  Efficient exact k-NN and nonparametric classification in high dimensions , 2003, NIPS 2003.

[24]  Lior Rokach,et al.  A survey of Clustering Algorithms , 2010, Data Mining and Knowledge Discovery Handbook.

[25]  José Oncina,et al.  Recognition of Pen-Based Music Notation: The HOMUS Dataset , 2014, 2014 22nd International Conference on Pattern Recognition.

[26]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[28]  David G. Lowe,et al.  Scalable Nearest Neighbor Algorithms for High Dimensional Data , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Qingsheng Zhu,et al.  Adaptive edited natural neighbor algorithm , 2017, Neurocomputing.

[30]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[31]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[32]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[33]  Juan Ramón Rico-Juan,et al.  New rank methods for reducing the size of the training set using the nearest neighbor rule , 2012, Pattern Recognit. Lett..

[34]  Francisco Herrera,et al.  Data Preprocessing in Data Mining , 2014, Intelligent Systems Reference Library.

[35]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[36]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[37]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[38]  William Eberle,et al.  Genetic algorithms in feature and instance selection , 2013, Knowl. Based Syst..

[39]  Patrick J. Grother,et al.  The First Census Optical Character Recognition Systems Conference | NIST , 1992 .

[40]  Francisco Herrera,et al.  Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[42]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[43]  Atsuto Maki,et al.  Factors of Transferability for a Generic ConvNet Representation , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[45]  Francisco Herrera,et al.  Enhancing evolutionary instance selection algorithms by means of fuzzy rough set based feature selection , 2012, Inf. Sci..

[46]  Xiaofeng Zhu,et al.  Efficient kNN Classification With Different Numbers of Nearest Neighbors , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[47]  Yann LeCun,et al.  Large-scale Learning with SVM and Convolutional for Generic Object Categorization , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[48]  Tony R. Martinez,et al.  Instance Pruning Techniques , 1997, ICML.

[49]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[50]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[51]  Yichuan Tang,et al.  Deep Learning using Linear Support Vector Machines , 2013, 1306.0239.

[52]  Belur V. Dasarathy,et al.  Nearest Neighbour Editing and Condensing Tools–Synergy Exploitation , 2000, Pattern Analysis & Applications.

[53]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[54]  Nicolás García-Pedrajas,et al.  Boosting instance selection algorithms , 2014, Knowl. Based Syst..

[55]  Inderjit S. Dhillon,et al.  Online Metric Learning and Fast Similarity Search , 2008, NIPS.

[57]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[58]  Victor S. Lempitsky,et al.  Neural Codes for Image Retrieval , 2014, ECCV.

[59]  G. Priya,et al.  EFFICIENT KNN CLASSIFICATION ALGORITHM FOR BIG DATA , 2017 .

[60]  Sergio Ramírez-Gallego,et al.  Nearest Neighbor Classification for High-Speed Big Data Streams Using Spark , 2017, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[61]  Jonathan J. Hull,et al.  A Database for Handwritten Text Recognition Research , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[62]  Francisco Herrera,et al.  On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining , 2006, Appl. Soft Comput..

[63]  Hadi Sadoghi Yazdi,et al.  IRAHC: Instance Reduction Algorithm using Hyperrectangle Clustering , 2015, Pattern Recognit..