A similarity-based approach for data stream classification

Incremental learning techniques have been used extensively to address the data stream classification problem. The most important issue is to maintain a balance between accuracy and efficiency, i.e., the algorithm should provide good classification performance with a reasonable time response. This work introduces a new technique, named Similarity-based Data Stream Classifier (SimC), which achieves good performance by introducing a novel insertion/removal policy that adapts quickly to the data tendency and maintains a representative, small set of examples and estimators that guarantees good classification rates. The methodology is also able to detect novel classes/labels, during the running phase, and to remove useless ones that do not add any value to the classification process. Statistical tests were used to evaluate the model performance, from two points of view: efficacy (classification rate) and efficiency (online response time). Five well-known techniques and sixteen data streams were compared, using the Friedman's test. Also, to find out which schemes were significantly different, the Nemenyi's, Holm's and Shaffer's tests were considered. The results show that SimC is very competitive in terms of (absolute and streaming) accuracy, and classification/updating time, in comparison to several of the most popular methods in the literature.

[1]  Gerhard Widmer Combining Robustness and Flexibility in Learning Drifting Concepts , 1994, ECAI.

[2]  João Gama,et al.  Learning decision trees from dynamic data streams , 2005, SAC '05.

[3]  Eyke Hüllermeier,et al.  Efficient instance-based learning on data streams , 2007, Intell. Data Anal..

[4]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[5]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[6]  Lap-Kei Lee,et al.  Continuous Monitoring of Distributed Data Streams over a Time-Based Sliding Window , 2011, Algorithmica.

[7]  ShakerAmmar,et al.  Evolving fuzzy pattern trees for binary classification on data streams , 2013 .

[8]  Lukasz A. Kurgan,et al.  CAIM discretization algorithm , 2004, IEEE Transactions on Knowledge and Data Engineering.

[9]  Thorsten Joachims,et al.  Detecting Concept Drift with Support Vector Machines , 2000, ICML.

[10]  Xindong Wu,et al.  Robust ensemble learning for mining noisy data streams , 2011, Decis. Support Syst..

[11]  Edwin Lughofer,et al.  FLEXFIS: A Robust Incremental Learning Approach for Evolving Takagi–Sugeno Fuzzy Models , 2008, IEEE Transactions on Fuzzy Systems.

[12]  Yang Zhang,et al.  Decision Tree for Dynamic and Uncertain Data Streams , 2010, ACML.

[13]  Jerzy Stefanowski,et al.  Reacting to Different Types of Concept Drift: The Accuracy Updated Ensemble Algorithm , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[14]  João Gama,et al.  Accurate decision trees for mining high-speed data streams , 2003, KDD '03.

[15]  Keun Ho Ryu,et al.  Sliding window based weighted maximal frequent pattern mining over data streams , 2014, Expert Syst. Appl..

[16]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[17]  Raju Nedunchezhian,et al.  Mining data streams with concept drifts using genetic algorithm , 2011, Artificial Intelligence Review.

[18]  Florent Masseglia,et al.  Atypicity detection in data streams: A self-adjusting approach , 2011, Intell. Data Anal..

[19]  Jesús S. Aguilar-Ruiz,et al.  Incremental Rule Learning and Border Examples Selection from Numerical Data Streams , 2005, J. Univers. Comput. Sci..

[20]  Satish S. Udpa,et al.  LEARN++: an incremental learning algorithm for multilayer perceptron networks , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[21]  Sattar Hashemi,et al.  Flexible decision tree for data stream classification in the presence of concept change, noise and missing values , 2009, Data Mining and Knowledge Discovery.

[22]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[23]  Matthew O. Ward,et al.  Mining neighbor-based patterns in data streams , 2013, Inf. Syst..

[24]  João Gama,et al.  Learning Decision Rules from Data Streams , 2011, IJCAI.

[25]  Li Zhao,et al.  Data stream classification with artificial endocrine system , 2011, Applied Intelligence.

[26]  Steven Salzberg,et al.  A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features , 2004, Machine Learning.

[27]  Hirokazu Ihara,et al.  Autonomous decentralized control and its application to the rapid transit system , 1984 .

[28]  Arkadiusz Wojna,et al.  RIONA: A Classifier Combining Rule Induction and k-NN Method with Automated Selection of Optimal Neighbourhood , 2002, ECML.

[29]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[30]  Karl Aberer,et al.  Distributed processing of continuous sliding-window k-NN queries for data stream filtering , 2011, World Wide Web.

[31]  João Gama,et al.  A survey on learning from data streams: current and future trends , 2012, Progress in Artificial Intelligence.

[32]  Jesús S. Aguilar-Ruiz,et al.  Classification model for data streams based on similarity , 2011, IEA/AIE'11.

[33]  Andrzej Skowron,et al.  K Nearest Neighbor Classification with Local Induction of the Simple Value Difference Metric , 2004, Rough Sets and Current Trends in Computing.

[34]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[35]  Niall M. Adams,et al.  lambda-Perceptron: An adaptive classifier for data streams , 2011, Pattern Recognit..

[36]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[37]  Ruoming Jin,et al.  Efficient decision tree construction on streaming data , 2003, KDD '03.

[38]  Latifur Khan,et al.  Facing the reality of data stream classification: coping with scarcity of labeled data , 2012, Knowledge and Information Systems.

[39]  David B. Skillicorn,et al.  Classifying Evolving Data Streams Using Dynamic Streaming Random Forests , 2008, DEXA.

[40]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[41]  Machiko Toyoda,et al.  Pattern discovery in data streams under the time warping distance , 2012, The VLDB Journal.

[42]  Sattar Hashemi,et al.  Adapted One-versus-All Decision Trees for Data Stream Classification , 2009, IEEE Transactions on Knowledge and Data Engineering.

[43]  Xindong Wu,et al.  Effective classification of noisy data streams with attribute-oriented dynamic classifier selection , 2006, Knowledge and Information Systems.

[44]  Xuegang Hu,et al.  Learning from concept drifting data streams with unlabeled data , 2012, Neurocomputing.

[45]  Ralf Klinkenberg,et al.  Learning drifting concepts: Example selection vs. example weighting , 2004, Intell. Data Anal..

[46]  Xindong Wu,et al.  A Double-Window-Based Classification Algorithm for Concept Drifting Data Streams , 2010, 2010 IEEE International Conference on Granular Computing.

[47]  Marcos Salganicoff,et al.  Tolerating Concept and Sampling Shift in Lazy Learning Using Prediction Error Context Switching , 1997, Artificial Intelligence Review.

[48]  Li Zhang,et al.  An adaptive ensemble classifier for mining concept drifting data streams , 2013, Expert Syst. Appl..

[49]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[50]  Xue Li,et al.  OcVFDT: one-class very fast decision tree for one-class classification of data streams , 2009, SensorKDD '09.

[51]  Xue Li,et al.  Learning from data streams with only positive and unlabeled data , 2013, Journal of Intelligent Information Systems.

[52]  ZhangPeng,et al.  Robust ensemble learning for mining noisy data streams , 2011 .

[53]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[54]  Geoff Holmes,et al.  Scalable and efficient multi-label classification for evolving data streams , 2012, Machine Learning.

[55]  Li Su,et al.  A New Classification Algorithm for Data Stream , 2011 .

[56]  Ricard Gavaldà,et al.  Adaptive Learning from Evolving Data Streams , 2009, IDA.

[57]  David L. Waltz,et al.  Toward memory-based reasoning , 1986, CACM.

[58]  Ali Hamzeh,et al.  A Precise Statistical approach for concept change detection in unlabeled data streams , 2011, Comput. Math. Appl..

[59]  Eyke Hüllermeier,et al.  Evolving fuzzy pattern trees for binary classification on data streams , 2013, Inf. Sci..

[60]  Eyke Hüllermeier,et al.  IBLStreams: a system for instance-based classification and regression on data streams , 2012, Evol. Syst..

[61]  Ester Bernadó-Mansilla,et al.  Fuzzy-UCS: A Michigan-Style Learning Fuzzy-Classifier System for Supervised Learning , 2009, IEEE Transactions on Evolutionary Computation.