Heterogeneous Ensemble for Feature Drifts in Data Streams

The nature of data streams requires classification algorithms to be real-time, efficient, and able to cope with high-dimensional data that are continuously arriving. It is a known fact that in high-dimensional datasets, not all features are critical for training a classifier. To improve the performance of data stream classification, we propose an algorithm called HEFT-Stream ( H eterogeneous E nsemble with F eature drif T for Data Streams ) that incorporates feature selection into a heterogeneous ensemble to adapt to different types of concept drifts. As an example of the proposed framework, we first modify the FCBF [13] algorithm so that it dynamically update the relevant feature subsets for data streams. Next, a heterogeneous ensemble is constructed based on different online classifiers, including Online Naive Bayes and CVFDT [5]. Empirical results show that our ensemble classifier outperforms state-of-the-art ensemble classifiers (AWE [15] and OnlineBagging [21]) in terms of accuracy, speed, and scalability. The success of HEFT-Stream opens new research directions in understanding the relationship between feature selection techniques and ensemble learning to achieve better classification performance.

[1]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[2]  J. Friedman Stochastic gradient boosting , 2002 .

[3]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[4]  Jaideep Srivastava,et al.  Diversity in Combinations of Heterogeneous Classifiers , 2009, PAKDD.

[5]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[6]  Kevin W. Bowyer,et al.  Combination of Multiple Classifiers Using Local Accuracy Estimates , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  A. Sharkey Linear and Order Statistics Combiners for Pattern Classification , 1999 .

[8]  Chunhua Shen,et al.  On the Dual Formulation of Boosting Algorithms , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Geoff Holmes,et al.  New ensemble methods for evolving data streams , 2009, KDD.

[10]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[12]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[13]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[14]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[15]  Stuart J. Russell,et al.  Online bagging and boosting , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[16]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[17]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[18]  Xindong Wu,et al.  Active Learning with Adaptive Heterogeneous Ensembles , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[19]  Fabio Roli,et al.  A theoretical and experimental analysis of linear combiners for multiple classifier systems , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Günther Eibl,et al.  Multiclass Boosting for Weak Classifiers , 2005, J. Mach. Learn. Res..

[21]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[22]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[23]  Sattar Hashemi,et al.  Adapted One-versus-All Decision Trees for Data Stream Classification , 2009, IEEE Transactions on Knowledge and Data Engineering.