Comprehensive analysis for class imbalance data with concept drift using ensemble based classification

In many information system applications, the environment is dynamic and tremendous amount of streaming data is generated. This scenario enforces additional computational demand on the algorithm to process incoming instances incrementally using restricted memory and time compared to static data mining. Moreover, when the streams of data are collected from different sources, it may exhibit concept drift, which means the variation in the distribution of data and it can have a high degree of class imbalance. The problem of class imbalance occurs when there is a much lower number of an example representing one class than those of the other class. Concept drift and imbalanced streaming data are commonly found in real-world applications such as fraud detection, intrusion detection, decision support system and disease prediction. In this paper, the different concept drift detectors and handling approaches are analysed when dealing with imbalance data. A comparative analysis of concept drift is performed on various data sets like SEA synthetic data stream and real world datasets. Massive Online Analysis (MOA) tool is used to make the comparative study about different learners in a concept drifting environment. The performance measure such as Accuracy, Precision, Recall, F1-score and Kappa statistic has been used to evaluate the performance of the various learners on SEA synthetic data stream and real world dataset. Ensemble classifiers and single learners are employed and tested on the data samples of SEA synthetic data stream, electrical and KDD intrusion data set. The ensemble classifiers provide better accuracy when compared to the single classifier and ensemble based methods has shown good performance compared to strong single learners when dealing with concept drift and class imbalance data.

[1]  Philip S. Yu,et al.  On demand classification of data streams , 2004, KDD.

[2]  Ruoming Jin,et al.  Efficient decision tree construction on streaming data , 2003, KDD '03.

[3]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[4]  Peter Tiño,et al.  Concept drift detection for online class imbalance learning , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[5]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[6]  Xin Yao,et al.  DDD: A New Ensemble Approach for Dealing with Concept Drift , 2012, IEEE Transactions on Knowledge and Data Engineering.

[7]  João Gama,et al.  Learning with Drift Detection , 2004, SBIA.

[8]  Zhiping Lin,et al.  Weighted Online Sequential Extreme Learning Machine for Class Imbalance Learning , 2013, Neural Processing Letters.

[9]  Gregory Ditzler,et al.  Incremental Learning of Concept Drift from Streaming Imbalanced Data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[10]  E. S. Page CONTINUOUS INSPECTION SCHEMES , 1954 .

[11]  H. Mouss,et al.  Test of Page-Hinckley, an approach for fault detection in an agro-alimentary production system , 2004, 2004 5th Asian Control Conference (IEEE Cat. No.04EX904).

[12]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[13]  Stephen D. Bay,et al.  Large Scale Detection of Irregularities in Accounting Data , 2006, Sixth International Conference on Data Mining (ICDM'06).

[14]  Niall M. Adams,et al.  The impact of changing populations on classifier performance , 1999, KDD '99.

[15]  W. D. Ray A Proof that the Sequential Probability Ratio Test (S.P.R.T.) of the General Linear Hypothesis Terminates with Probability Unity , 1957 .

[16]  Lei Du,et al.  A Selective Detector Ensemble for Concept Drift Detection , 2015, Comput. J..

[17]  Koichiro Yamauchi,et al.  Detecting sudden concept drift with knowledge of human behavior , 2008, 2008 IEEE International Conference on Systems, Man and Cybernetics.

[18]  C. Arun,et al.  Automated risk identification using NLP in cloud based development environments , 2017 .

[19]  J. Wolfowitz On Wald's Proof of the Consistency of the Maximum Likelihood Estimate , 1949 .

[20]  Robi Polikar,et al.  Incremental Learning of Concept Drift in Nonstationary Environments , 2011, IEEE Transactions on Neural Networks.

[21]  Jerzy Stefanowski,et al.  Accuracy Updated Ensemble for Data Streams with Concept Drift , 2011, HAIS.

[22]  Roberto Souto Maior de Barros,et al.  A Lightweight Concept Drift Detection Ensemble , 2015, 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI).

[23]  LastMark Online classification of nonstationary data streams , 2002 .

[24]  Jerzy Stefanowski,et al.  Reacting to Different Types of Concept Drift: The Accuracy Updated Ensemble Algorithm , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[25]  Cesare Alippi,et al.  Hierarchical Change-Detection Tests , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[26]  Tuwe Löfström,et al.  On Effectively Creating Ensembles of Classifiers: Studies on Creation Strategies, Diversity and Predicting with Confidence , 2015 .

[27]  Heng Wang,et al.  Concept drift detection for streaming data , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[28]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[29]  Xin Yao,et al.  Dealing with Multiple Classes in Online Class Imbalance Learning , 2016, IJCAI.

[30]  Cesare Alippi,et al.  A just-in-time adaptive classification system based on the intersection of confidence intervals rule , 2011, Neural Networks.

[31]  Bartosz Krawczyk,et al.  Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets , 2016, Pattern Recognit..

[32]  Pierluigi Siano,et al.  An approach to fault diagnosis of nonlinear systems using neural networks with invariance to Fourier transform , 2013, J. Ambient Intell. Humaniz. Comput..

[33]  Ricard Gavaldà,et al.  Kalman Filters and Adaptive Windows for Learning in Data Streams , 2006, Discovery Science.

[34]  Žliobait . e,et al.  Learning under Concept Drift: an Overview , 2010 .

[35]  Xin Yao,et al.  Resampling-Based Ensemble Methods for Online Class Imbalance Learning , 2015, IEEE Transactions on Knowledge and Data Engineering.

[36]  Mohamed Medhat Gaber,et al.  Knowledge discovery from data streams , 2009, IDA 2009.

[37]  Marcus A. Maloof,et al.  Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts , 2007, J. Mach. Learn. Res..

[38]  Longbing Cao,et al.  Effective detection of sophisticated online banking fraud on extremely imbalanced data , 2012, World Wide Web.

[39]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[40]  Ricard Gavaldà,et al.  Learning from Time-Changing Data with Adaptive Windowing , 2007, SDM.

[41]  David A. Cieslak,et al.  Combating imbalance in network intrusion datasets , 2006, 2006 IEEE International Conference on Granular Computing.

[42]  Michal Wozniak,et al.  Ensembles of Heterogeneous Concept Drift Detectors - Experimental Study , 2016, CISIM.

[43]  Dimitris K. Tasoulis,et al.  Exponentially weighted moving average charts for detecting concept drift , 2012, Pattern Recognit. Lett..

[44]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[45]  Shigeo Abe,et al.  An Incremental Learning Algorithm of Ensemble Classifier Systems , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[46]  Bartosz Krawczyk,et al.  Weighted Naïve Bayes Classifier with Forgetting for Drifting Data Streams , 2015, 2015 IEEE International Conference on Systems, Man, and Cybernetics.

[47]  Jerzy Stefanowski,et al.  Neighbourhood sampling in bagging for imbalanced data , 2015, Neurocomputing.

[48]  Jerzy Stefanowski,et al.  Prequential AUC for Classifier Evaluation and Drift Detection in Evolving Data Streams , 2014, NFMCP.

[49]  Gianmarco De Francisci Morales,et al.  SAMOA: scalable advanced massive online analysis , 2015, J. Mach. Learn. Res..

[50]  Michal Wozniak,et al.  Comparable Study of Statistical Tests for Virtual Concept Drift Detection , 2013, CORES.

[51]  Vicenç Puig,et al.  Fault Diagnosis Using a Timed Discrete-Event Approach Based on Interval Observers: Application to Sewer Networks , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[52]  Taghi M. Khoshgoftaar,et al.  Big Data fraud detection using multiple medicare data sources , 2018, J. Big Data.

[53]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[54]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[55]  Nikunj C. Oza,et al.  Online Ensemble Learning , 2000, AAAI/IAAI.

[56]  Zhiping Lin,et al.  Meta-cognitive online sequential extreme learning machine for imbalanced and concept-drifting data classification , 2016, Neural Networks.

[57]  Stephen H Bryant,et al.  An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data. , 2014, Analytica chimica acta.

[58]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[59]  M. Thenmozhi,et al.  Intrusion detection system based on GA‐fuzzy classifier for detecting malicious attacks , 2019, Concurr. Comput. Pract. Exp..

[60]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[61]  Mark Last,et al.  Online classification of nonstationary data streams , 2002, Intell. Data Anal..

[62]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[63]  Hadi Sadoghi Yazdi,et al.  Recursive least square perceptron model for non-stationary and imbalanced data stream classification , 2013, Evol. Syst..