An Efficient Method of Building an Ensemble of Classifiers in Streaming Data

To efficiently refine a classifier in streaming data such as sensor data and web log data we have to decide whether each streaming unlabeled datum is selected or not. The exiting methods refine a classifier based on a regular time interval. They refine a classifier even if the classification accuracy of the classifier is high. Also it uses a classifier even if the classification accuracy is low. In this paper, our ensemble method selects data in an online process that should be labeled. The selected data are used to build new classifiers of an ensemble. Our selection methodology uses training data that are applied to generate an ensemble of classifiers over streaming data. We compared the results of our ensemble approach and of a conventional ensemble approach where new classifiers for an ensemble are periodically generated. In experiments with ten benchmark data sets including three real streaming data sets, our ensemble approach generated 12.9% new classifiers for the chunk-based ensemble approach using partially labeled samples, and used an average of 10% labeled samples for the ten data sets. In all the experiments, our ensemble approach produced comparable classification accuracy. We showed that our approach can efficiently maintain the performance of an ensemble over streaming data.

[1]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[2]  João Gama,et al.  Issues in evaluation of stream learning algorithms , 2009, KDD.

[3]  Wang Yong,et al.  Mining Multi-label Concept-Drifting Streams Using Ensemble Classifiers , 2009, 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery.

[4]  Latifur Khan,et al.  Lacking Labels in the Stream: Classifying Evolving Stream Data with Few Labels , 2009, ISMIS.

[5]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[6]  Yong Shi,et al.  Categorizing and mining concept drifting data streams , 2008, KDD.

[7]  Xiaodong Lin,et al.  Active Learning from Data Streams , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[8]  Xin Yao,et al.  DDD: A New Ensemble Approach for Dealing with Concept Drift , 2012, IEEE Transactions on Knowledge and Data Engineering.

[9]  Shusaku Tsumoto,et al.  Foundations of Intelligent Systems, 15th International Symposium, ISMIS 2005, Saratoga Springs, NY, USA, May 25-28, 2005, Proceedings , 2005, ISMIS.

[10]  J. W. Ryu,et al.  Ensemble Classifier based on Misclassified Streaming Data , 2010 .

[11]  Xindong Wu,et al.  An Aggregate Ensemble for Mining Concept Drifting Data Streams with Noise , 2009, PAKDD.

[12]  Bhavani M. Thuraisingham,et al.  A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[13]  Jiawei Han,et al.  On Appropriate Assumptions to Mine Data Streams: Analysis and Practice , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[14]  Carlo Zaniolo,et al.  Fast and Light Boosting for Adaptive Mining of Data Streams , 2004, PAKDD.