Nearest Neighbor Classification for High-Speed Big Data Streams Using Spark

Mining massive and high-speed data streams among the main contemporary challenges in machine learning. This calls for methods displaying a high computational efficacy, with ability to continuously update their structure and handle ever-arriving big number of instances. In this paper, we present a new incremental and distributed classifier based on the popular nearest neighbor algorithm, adapted to such a demanding scenario. This method, implemented in Apache Spark, includes a distributed metric-space ordering to perform faster searches. Additionally, we propose an efficient incremental instance selection method for massive data streams that continuously update and remove outdated examples from the case-base. This alleviates the high computational requirements of the original classifier, thus making it suitable for the considered problem. Experimental study conducted on a set of real-life massive data streams proves the usefulness of the proposed solution and shows that we are able to provide the first efficient nearest neighbor solution for high-speed big and streaming data.

[1]  Ting Liu,et al.  Clustering Billions of Images with Large Scale Nearest Neighbor Search , 2007, 2007 IEEE Workshop on Applications of Computer Vision (WACV '07).

[2]  Charu C. Aggarwal,et al.  Data Mining: The Textbook , 2015 .

[3]  Xindong Wu,et al.  The Top Ten Algorithms in Data Mining , 2009 .

[4]  Eric Gossett,et al.  Big Data: A Revolution That Will Transform How We Live, Work, and Think , 2015 .

[5]  Hongjie Jia,et al.  Self-Tuning p-Spectral Clustering Based on Shared Nearest Neighbors , 2015, Cognitive Computation.

[6]  Geoff Holmes,et al.  Batch-Incremental versus Instance-Incremental Learning in Dynamic and Evolving Data , 2012, IDA.

[7]  Geoff Hulten,et al.  A General Framework for Mining Massive Data Streams , 2003 .

[8]  Walid G. Aref,et al.  Precision-Bounded Access Control Using Sliding-Window Query Views for Privacy-Preserving Data Streams , 2015, IEEE Transactions on Knowledge and Data Engineering.

[9]  Francisco Herrera,et al.  kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data , 2017, Knowl. Based Syst..

[10]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[11]  Kenneth J. Supowit,et al.  The Relative Neighborhood Graph, with an Application to Minimum Spanning Trees , 1983, JACM.

[12]  Francisco Herrera,et al.  Data Preprocessing in Data Mining , 2014, Intelligent Systems Reference Library.

[13]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[14]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15]  Albert Bifet,et al.  Efficient Online Evaluation of Big Data Stream Classifiers , 2015, KDD.

[16]  María José del Jesús,et al.  Big Data with Cloud Computing: an insight on the computing environment, MapReduce, and programming frameworks , 2014, WIREs Data Mining Knowl. Discov..

[17]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[18]  Michal Wozniak,et al.  A hybrid decision tree training method using data streams , 2011, Knowledge and Information Systems.

[19]  Filiberto Pla,et al.  Prototype selection for the nearest neighbour rule through proximity graphs , 1997, Pattern Recognit. Lett..

[20]  João Gama,et al.  Very fast decision rules for classification in data streams , 2013, Data Mining and Knowledge Discovery.

[21]  Ramasamy Uthurusamy,et al.  Evolving data into mining solutions for insights , 2002, CACM.

[22]  Haibo He,et al.  Incremental Learning From Stream Data , 2011, IEEE Transactions on Neural Networks.

[23]  Marcos Dias de Assunção,et al.  Apache Spark , 2019, Encyclopedia of Big Data Technologies.

[24]  Viktor Mayer-Schnberger,et al.  Big Data: A Revolution That Will Transform How We Live, Work, and Think , 2013 .

[25]  Mukesh Prasad,et al.  Attribute Equilibrium Dominance Reduction Accelerator (DCCAEDR) Based on Distributed Coevolutionary Cloud and Its Application in Medical Records , 2016, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[26]  Hongjie Jia,et al.  Study on density peaks clustering based on k-nearest neighbors and principal component analysis , 2016, Knowl. Based Syst..

[27]  Andrew W. Moore,et al.  An Investigation of Practical Approximate Nearest Neighbor Algorithms , 2004, NIPS.

[28]  Lei Du,et al.  Detecting concept drift: An information entropy based method using an adaptive sliding window , 2014, Intell. Data Anal..

[29]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[30]  Mohamed Medhat Gaber,et al.  Advances in data stream mining , 2012, WIREs Data Mining Knowl. Discov..

[31]  Albert Bifet,et al.  DATA STREAM MINING A Practical Approach , 2009 .

[32]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[33]  Matthew O. Ward,et al.  Mining neighbor-based patterns in data streams , 2013, Inf. Syst..

[34]  Jifu Zhang,et al.  FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce , 2016, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[35]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[36]  Walid G. Aref,et al.  Tornado: A Distributed Spatio-Textual Stream Processing System , 2015, Proc. VLDB Endow..

[37]  João Gama,et al.  Ensemble learning for data stream analysis: A survey , 2017, Inf. Fusion.

[38]  Christophe G. Giraud-Carrier,et al.  Efficient mining of high-speed uncertain data streams , 2015, Applied Intelligence.

[39]  Patrick Wendell,et al.  Learning Spark: Lightning-Fast Big Data Analytics , 2015 .

[40]  Ludmila I. Kuncheva,et al.  Nearest Neighbour Classifiers for Streaming Data with Delayed Labelling , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[41]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[42]  Jesús S. Aguilar-Ruiz,et al.  Knowledge discovery from data streams , 2009, Intell. Data Anal..

[43]  Francisco Herrera,et al.  A survey on data preprocessing for data stream mining: Current status and future directions , 2017, Neurocomputing.

[44]  Adir Even,et al.  Data Stream Mining with Multiple sliding Windows for continuous Prediction , 2014, ECIS.

[45]  Jimmy J. Lin MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail! , 2012, Big Data.

[46]  Derong Liu,et al.  Detecting and Reacting to Changes in Sensing Units: The Active Classifier Case , 2014, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[47]  Hanan Samet,et al.  Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling) , 2005 .