Improvements to Boosting with Data Streams

Data Streams (DS) pose a challenge for any machine learning algorithm, because of high volume of data - on the order of millions of instances for a typical data set. Various algorithms were proposed, in particular, OzaBoost - a parallel adaptation of AdaBoost - creates various “weak” learners in parallel and updates each of them with new instances during training. At any moment, OzaBoost can stop and output the final model. OzaBoost suffers with memory consumption, which avoids its use for certain types of problems. This work introduces OzaBoost Dynamic, which changes the weight calculation and the number of boosted “weak” learners used by OzaBoost to improve its performance in terms of memory consumption. This work presents the empirical results showing the performance of all algorithms using data sets with 50 and 60 million instances.

[1]  Stuart J. Russell,et al.  Online bagging and boosting , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[2]  Stan Matwin,et al.  Extending AdaBoost to Iteratively Vary Its Base Classifiers , 2011, Canadian Conference on AI.

[3]  Stan Matwin,et al.  Improvements to AdaBoost Dynamic , 2012, Canadian Conference on AI.

[4]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[5]  Albert Bifet,et al.  DATA STREAM MINING A Practical Approach , 2009 .

[6]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.