Towards a Parallel Computationally Efficient Approach to Scaling Up Data Stream Classification

Advances in hardware technologies allow to capture and process data in real-time and the resulting high throughput data streams require novel data mining approaches. The research area of Data Stream Mining (DSM) is developing data mining algorithms that allow us to analyse these continuous streams of data in real-time. The creation and real-time adaption of classification models from data streams is one of the most challenging DSM tasks. Current classifiers for streaming data address this problem by using incremental learning algorithms. However, even so these algorithms are fast, they are challenged by high velocity data streams, where data instances are incoming at a fast rate. This is problematic if the applications desire that there is no or only a very little delay between changes in the patterns of the stream and absorption of these patterns by the classifier. Problems of scalability to Big Data of traditional data mining algorithms for static (non streaming) datasets have been addressed through the development of parallel classifiers. However, there is very little work on the parallelisation of data stream classification techniques. In this paper we investigate K-Nearest Neighbours (KNN) as the basis for a real-time adaptive and parallel methodology for scalable data stream classification tasks.

[1]  Feifei Li,et al.  Efficient parallel kNN joins for large data in MapReduce , 2012, EDBT '12.

[2]  Richard Granger,et al.  Beyond Incremental Processing: Tracking Concept Drift , 1986, AAAI.

[3]  Jens Myrup Pedersen,et al.  A method for classification of network traffic based on C5.0 Machine Learning Algorithm , 2012, 2012 International Conference on Computing, Networking and Communications (ICNC).

[4]  Liheng Jian,et al.  CUKNN: A parallel implementation of K-nearest neighbor on CUDA-enabled GPU , 2009, 2009 IEEE Youth Conference on Information, Computing and Telecommunication.

[5]  Li Guo,et al.  Enabling Fast Lazy Learning for Data Streams , 2011, 2011 IEEE 11th International Conference on Data Mining.

[6]  Jesús S. Aguilar-Ruiz,et al.  Knowledge discovery from data streams , 2009, Intell. Data Anal..

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Vipin Kumar,et al.  ScalParC: a new scalable and efficient parallel classification algorithm for mining large datasets , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[9]  Max Bramer,et al.  Computationally efficient induction of classification rules with the PMCRI and J-PMCRI frameworks , 2012, Knowl. Based Syst..

[10]  H. D. Dilectin,et al.  Classification and dynamic class detection of real time data for tsunami warning system , 2012, 2012 International Conference on Recent Advances in Computing and Software Systems.

[11]  Ambarish Jadhav,et al.  A novel approach for the design of network intrusion detection system(NIDS) , 2013, PROCEEDINGS OF 2013 International Conference on Sensor Network Security Technology and Privacy Communication System.

[12]  Geoff Hulten,et al.  A General Framework for Mining Massive Data Streams , 2003 .

[13]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[14]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[15]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[16]  Tim French,et al.  Online learning classifiers in dynamic environments with incomplete feedback , 2013, 2013 IEEE Congress on Evolutionary Computation.

[17]  Antonio Soriano,et al.  Automatic credit card fraud detection based on non-linear signal processing , 2012, 2012 IEEE International Carnahan Conference on Security Technology (ICCST).

[18]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[19]  Mohamed Medhat Gaber,et al.  A Survey of Classification Methods in Data Streams , 2007, Data Streams - Models and Algorithms.

[20]  Giuseppe Di Fatta,et al.  Space Partitioning for Scalable K-Means , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[21]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.