Combining nearest neighbour classifiers based on small subsamples for big data analytics

Contemporary machine learning systems must be able to deal with ever-growing volumes of data. However, most of the canonical classifiers are not well-suited for big data analytics. This is especially vivid in case of distance-based classifiers, where their classification time is prohibitive. Recently, many methods for adapting nearest neighbour classifier for big data were proposed. We investigate simple, yet efficient technique based on random under-sampling of the dataset. As we deal with stationary data, one may assume that a subset of objects will sufficiently capture the properties of given dataset. We propose to build distance-based classifiers on the basis of very small subsamples and then combine them into an ensemble. With this, one does not need to aggregate datasets, only local decisions of classifiers. On the basis of experimental results we show that such an approach can return comparable results to nearest neighbour classifier over the entire dataset, but with a significantly reduced classification time. We investigate the number of sub-samples (ensemble members), that are required for capturing the properties of each dataset. Finally, we propose to apply our sub-sampling based ensemble in a distributed environment, which allows for a further reduction of the computational complexity of nearest neighbour rule for big data.

[1]  Francisco Herrera,et al.  MRPR: A MapReduce solution for prototype reduction in big data classification , 2015, Neurocomputing.

[2]  Wei Fan,et al.  Mining big data: current status, and forecast to the future , 2013, SKDD.

[3]  Slawomir Gruszczynski,et al.  Hybrid computer vision system for drivers' eye recognition and fatigue monitoring , 2014, Neurocomputing.

[4]  Geoff Hulten,et al.  A General Framework for Mining Massive Data Streams , 2003 .

[5]  Francisco Herrera,et al.  Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Paul Zikopoulos,et al.  Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data , 2011 .

[7]  Bartosz Krawczyk,et al.  One-class classifiers with incremental learning and forgetting for data streams with concept drift , 2015, Soft Comput..

[8]  Francisco Herrera,et al.  On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining , 2006, Appl. Soft Comput..

[9]  Erik Brynjolfsson,et al.  Big data: the management revolution. , 2012, Harvard business review.

[10]  Boguslaw Cyganek,et al.  Novel parallel algorithm for object recognition with the ensemble of classifiers based on the Higher-Order Singular Value Decomposition of prototype pattern tensors , 2014, 2014 International Conference on Computer Vision Theory and Applications (VISAPP).

[11]  Michal Wozniak,et al.  Optical networks for cost-efficient and scalable provisioning of big data traffic , 2015, Int. J. Parallel Emergent Distributed Syst..

[12]  Konrad Jackowski,et al.  Fixed-size ensemble classifier system evolutionarily adapted to a recurring context with an unlimited pool of classifiers , 2013, Pattern Analysis and Applications.

[13]  Robert Burduk Classifier fusion with interval-valued weights , 2013, Pattern Recognit. Lett..

[14]  Bartosz Krawczyk,et al.  Data stream classification and big data analytics , 2015, Neurocomputing.

[15]  Emilio Corchado,et al.  A survey of multiple classifier systems as hybrid systems , 2014, Inf. Fusion.

[16]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Przemyslaw Kudlacik,et al.  A new approach to signature recognition using the fuzzy method , 2014, Pattern Analysis and Applications.

[18]  C. Lynch Big data: How do your data grow? , 2008, Nature.

[19]  Francisco Herrera,et al.  A Taxonomy and Experimental Study on Prototype Generation for Nearest Neighbor Classification , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).