Instance selection of linear complexity for big data

Over recent decades, database sizes have grown considerably. Larger sizes present new challenges, because machine learning algorithms are not prepared to process such large volumes of information. Instance selection methods can alleviate this problem when the size of the data set is medium to large. However, even these methods face similar problems with very large-to-massive data sets.In this paper, two new algorithms with linear complexity for instance selection purposes are presented. Both algorithms use locality-sensitive hashing to find similarities between instances. While the complexity of conventional methods (usually quadratic, O ( n 2 ) , or log-linear, O ( n log n ) ) means that they are unable to process large-sized data sets, the new proposal shows competitive results in terms of accuracy. Even more remarkably, it shortens execution time, as the proposal manages to reduce complexity and make it linear with respect to the data set size. The new proposal has been compared with some of the best known instance selection methods for testing and has also been evaluated on large data sets (up to a million instances).

[1]  Francisco Herrera,et al.  MRPR: A MapReduce solution for prototype reduction in big data classification , 2015, Neurocomputing.

[2]  Ginés Rubio,et al.  New method for instance or prototype selection using mutual information in time series prediction , 2010, Neurocomputing.

[3]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[4]  Nicolás García-Pedrajas,et al.  A cooperative coevolutionary algorithm for instance selection for instance-based learning , 2010, Machine Learning.

[5]  Daniel Asimov,et al.  The grand tour: a tool for viewing multidimensional data , 1985 .

[6]  Francisco Herrera,et al.  A Taxonomy and Experimental Study on Prototype Generation for Nearest Neighbor Classification , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[7]  Georgios Evangelidis,et al.  FHC: an adaptive fast hybrid method for k-NN classification , 2015, Log. J. IGPL.

[8]  Alexey Tsymbal,et al.  The problem of concept drift: definitions and related work , 2004 .

[9]  Francisco Herrera,et al.  Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Chris Mellish,et al.  Advances in Instance Selection for Instance-Based Learning Algorithms , 2002, Data Mining and Knowledge Discovery.

[11]  Francisco Herrera,et al.  A Survey on Evolutionary Instance Selection and Generation , 2010, Int. J. Appl. Metaheuristic Comput..

[12]  Xindong Wu,et al.  10 Challenging Problems in Data Mining Research , 2006, Int. J. Inf. Technol. Decis. Mak..

[13]  B. John Oommen,et al.  A brief taxonomy and ranking of creative prototype reduction schemes , 2003, Pattern Analysis & Applications.

[14]  G. Gates,et al.  The reduced nearest neighbor rule (Corresp.) , 1972, IEEE Trans. Inf. Theory.

[15]  Nicolás García-Pedrajas,et al.  Boosting instance selection algorithms , 2014, Knowl. Based Syst..

[16]  Juan José Rodríguez Diez,et al.  Instance selection for regression by discretization , 2016, Expert Syst. Appl..

[17]  Nicolás García-Pedrajas,et al.  A divide-and-conquer recursive approach for scaling up instance selection algorithms , 2009, Data Mining and Knowledge Discovery.

[18]  Tony R. Martinez,et al.  Instance Pruning Techniques , 1997, ICML.

[19]  Marcin Blachnik,et al.  Instance Selection in Logical Rule Extraction for Regression Problems , 2013, ICAISC.

[20]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[21]  Elena Marchiori,et al.  Class Conditional Nearest Neighbor for Large Margin Instance Selection , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Hung Tran-The,et al.  Fast Approximate near Neighbor Algorithm by Clustering in High Dimensions , 2015, 2015 Seventh International Conference on Knowledge and Systems Engineering (KSE).

[23]  Juan Ramón Rico-Juan,et al.  Improving kNN multi-label classification in Prototype Selection scenarios using class proposals , 2015, Pattern Recognit..

[24]  Loris Nanni,et al.  Prototype reduction techniques: A comparison among different approaches , 2011, Expert Syst. Appl..

[25]  José Francisco Martínez Trinidad,et al.  A new fast prototype selection method based on clustering , 2010, Pattern Analysis and Applications.

[26]  Sergio Ramírez-Gallego,et al.  Evolutionary Feature Selection for Big Data Classification: A MapReduce Approach , 2015 .

[27]  Francisco Herrera,et al.  On the use of evolutionary feature selection for improving fuzzy rough set based prototype selection , 2012, Soft Computing.

[28]  Nicolás García-Pedrajas,et al.  Democratic instance selection: A linear complexity instance selection algorithm based on classifier ensemble concepts , 2010, Artif. Intell..

[29]  Francisco Herrera,et al.  Stratified prototype selection based on a steady-state memetic algorithm: a study of scalability , 2010, Memetic Comput..

[30]  José Francisco Martínez Trinidad,et al.  Prototype selection based on sequential search , 2009, Intell. Data Anal..

[31]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[32]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[33]  Juan Li,et al.  A new fast reduction technique based on binary nearest neighbor tree , 2015, Neurocomputing.

[34]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[35]  Chris Mellish,et al.  On the Consistency of Information Filters for Lazy Learning Algorithms , 1999, PKDD.

[36]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[37]  Antonio González Muñoz,et al.  Knowledge-based instance selection: A compromise between efficiency and versatility , 2013, Knowl. Based Syst..

[38]  Y. Hochberg A sharper Bonferroni procedure for multiple tests of significance , 1988 .

[39]  Marek Grochowski,et al.  Comparison of Instances Seletion Algorithms I. Algorithms Survey , 2004, ICAISC.

[40]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[41]  José Salvador Sánchez,et al.  Decision boundary preserving prototype selection for nearest neighbor classification , 2005, Int. J. Pattern Recognit. Artif. Intell..

[42]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[43]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[44]  Javier Pérez-Rodríguez,et al.  Simultaneous instance and feature selection and weighting using evolutionary computation: Proposal and study , 2015, Appl. Soft Comput..

[45]  Elena Marchiori,et al.  Hit Miss Networks with Applications to Instance Selection , 2008, J. Mach. Learn. Res..

[46]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[47]  Robert P. W. Duin,et al.  Prototype selection for dissimilarity-based classifiers , 2006, Pattern Recognit..

[48]  J. A. Pérez-Benitez,et al.  Novel data condensing method using a prototype's front propagation algorithm , 2015, Eng. Appl. Artif. Intell..

[49]  Nitesh V. Chawla,et al.  Scaling up Classifiers to Cloud Computers , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[50]  Francisco Herrera,et al.  Stratification for scaling up evolutionary prototype selection , 2005, Pattern Recognit. Lett..

[51]  Antonio González Muñoz,et al.  Three new instance selection methods based on local sets: A comparative study with several approaches from a bi-objective perspective , 2015, Pattern Recognit..

[52]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[53]  Gregory Ditzler,et al.  Learning in Nonstationary Environments: A Survey , 2015, IEEE Computational Intelligence Magazine.

[54]  Fadi Dornaika,et al.  Decremental Sparse Modeling Representative Selection for prototype selection , 2015, Pattern Recognit..

[55]  Hugh B. Woodruff,et al.  An algorithm for a selective nearest neighbor decision rule (Corresp.) , 1975, IEEE Trans. Inf. Theory.

[56]  José Francisco Martínez Trinidad,et al.  InstanceRank based on borders for instance selection , 2013, Pattern Recognit..

[57]  Loris Nanni,et al.  A clustering method for automatic biometric template selection , 2006, Pattern Recognit..

[58]  David G. Stork,et al.  Pattern Classification , 1973 .

[59]  Andrew W. Moore,et al.  An Investigation of Practical Approximate Nearest Neighbor Algorithms , 2004, NIPS.

[60]  Emanuela Merelli,et al.  Topology driven modeling: the IS metaphor , 2014, Natural Computing.

[61]  María José del Jesús,et al.  KEEL: a software tool to assess evolutionary algorithms for data mining problems , 2008, Soft Comput..

[62]  Zoran Stajic,et al.  A methodology for training set instance selection using mutual information in time series prediction , 2014, Neurocomputing.