论文信息 - Nearest Neighbor Classification for High-Speed Big Data Streams Using Spark

Nearest Neighbor Classification for High-Speed Big Data Streams Using Spark

Mining massive and high-speed data streams among the main contemporary challenges in machine learning. This calls for methods displaying a high computational efficacy, with ability to continuously update their structure and handle ever-arriving big number of instances. In this paper, we present a new incremental and distributed classifier based on the popular nearest neighbor algorithm, adapted to such a demanding scenario. This method, implemented in Apache Spark, includes a distributed metric-space ordering to perform faster searches. Additionally, we propose an efficient incremental instance selection method for massive data streams that continuously update and remove outdated examples from the case-base. This alleviates the high computational requirements of the original classifier, thus making it suitable for the considered problem. Experimental study conducted on a set of real-life massive data streams proves the usefulness of the proposed solution and shows that we are able to provide the first efficient nearest neighbor solution for high-speed big and streaming data.

[1] Ting Liu,et al. Clustering Billions of Images with Large Scale Nearest Neighbor Search , 2007, 2007 IEEE Workshop on Applications of Computer Vision (WACV '07).

[2] Charu C. Aggarwal,et al. Data Mining: The Textbook , 2015 .

[3] Xindong Wu,et al. The Top Ten Algorithms in Data Mining , 2009 .

[4] Eric Gossett,et al. Big Data: A Revolution That Will Transform How We Live, Work, and Think , 2015 .

[5] Hongjie Jia,et al. Self-Tuning p-Spectral Clustering Based on Shared Nearest Neighbors , 2015, Cognitive Computation.

[6] Geoff Holmes,et al. Batch-Incremental versus Instance-Incremental Learning in Dynamic and Evolving Data , 2012, IDA.

[7] Geoff Hulten,et al. A General Framework for Mining Massive Data Streams , 2003 .

[8] Walid G. Aref,et al. Precision-Bounded Access Control Using Sliding-Window Query Views for Privacy-Preserving Data Streams , 2015, IEEE Transactions on Knowledge and Data Engineering.

[9] Francisco Herrera,et al. kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data , 2017, Knowl. Based Syst..

[10] Piotr Indyk,et al. Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[11] Kenneth J. Supowit,et al. The Relative Neighborhood Graph, with an Application to Minimum Spanning Trees , 1983, JACM.

[12] Francisco Herrera,et al. Data Preprocessing in Data Mining , 2014, Intelligent Systems Reference Library.

[13] Bartosz Krawczyk,et al. Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[14] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15] Albert Bifet,et al. Efficient Online Evaluation of Big Data Stream Classifiers , 2015, KDD.

[16] María José del Jesús,et al. Big Data with Cloud Computing: an insight on the computing environment, MapReduce, and programming frameworks , 2014, WIREs Data Mining Knowl. Discov..

[17] Ameet Talwalkar,et al. MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[18] Michal Wozniak,et al. A hybrid decision tree training method using data streams , 2011, Knowledge and Information Systems.

[19] Filiberto Pla,et al. Prototype selection for the nearest neighbour rule through proximity graphs , 1997, Pattern Recognit. Lett..

[20] João Gama,et al. Very fast decision rules for classification in data streams , 2013, Data Mining and Knowledge Discovery.

[21] Ramasamy Uthurusamy,et al. Evolving data into mining solutions for insights , 2002, CACM.

[22] Haibo He,et al. Incremental Learning From Stream Data , 2011, IEEE Transactions on Neural Networks.

[23] Marcos Dias de Assunção,et al. Apache Spark , 2019, Encyclopedia of Big Data Technologies.

[24] Viktor Mayer-Schnberger,et al. Big Data: A Revolution That Will Transform How We Live, Work, and Think , 2013 .

[25] Mukesh Prasad,et al. Attribute Equilibrium Dominance Reduction Accelerator (DCCAEDR) Based on Distributed Coevolutionary Cloud and Its Application in Medical Records , 2016, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[26] Hongjie Jia,et al. Study on density peaks clustering based on k-nearest neighbors and principal component analysis , 2016, Knowl. Based Syst..

[27] Andrew W. Moore,et al. An Investigation of Practical Approximate Nearest Neighbor Algorithms , 2004, NIPS.

[28] Lei Du,et al. Detecting concept drift: An information entropy based method using an adaptive sliding window , 2014, Intell. Data Anal..

[29] Geoff Hulten,et al. Mining time-changing data streams , 2001, KDD '01.

[30] Mohamed Medhat Gaber,et al. Advances in data stream mining , 2012, WIREs Data Mining Knowl. Discov..

[31] Albert Bifet,et al. DATA STREAM MINING A Practical Approach , 2009 .

[32] Peter E. Hart,et al. Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[33] Matthew O. Ward,et al. Mining neighbor-based patterns in data streams , 2013, Inf. Syst..

[34] Jifu Zhang,et al. FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce , 2016, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[35] João Gama,et al. A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[36] Walid G. Aref,et al. Tornado: A Distributed Spatio-Textual Stream Processing System , 2015, Proc. VLDB Endow..

[37] João Gama,et al. Ensemble learning for data stream analysis: A survey , 2017, Inf. Fusion.

[38] Christophe G. Giraud-Carrier,et al. Efficient mining of high-speed uncertain data streams , 2015, Applied Intelligence.

[39] Patrick Wendell,et al. Learning Spark: Lightning-Fast Big Data Analytics , 2015 .

[40] Ludmila I. Kuncheva,et al. Nearest Neighbour Classifiers for Streaming Data with Delayed Labelling , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[41] Xindong Wu,et al. Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[42] Jesús S. Aguilar-Ruiz,et al. Knowledge discovery from data streams , 2009, Intell. Data Anal..

[43] Francisco Herrera,et al. A survey on data preprocessing for data stream mining: Current status and future directions , 2017, Neurocomputing.

[44] Adir Even,et al. Data Stream Mining with Multiple sliding Windows for continuous Prediction , 2014, ECIS.

[45] Jimmy J. Lin. MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail! , 2012, Big Data.

[46] Derong Liu,et al. Detecting and Reacting to Changes in Sensing Units: The Active Classifier Case , 2014, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[47] Hanan Samet,et al. Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling) , 2005 .