Calculating feature importance in data streams with concept drift using Online Random Forest

Large volume data streams with concept drift have garnered a great deal of attention in the machine learning community. Numerous researchers have proposed online learning algorithms that train iteratively from new observations, and provide continuously relevant predictions. Compared to previous offline, or sliding window approaches, these algorithms have shown better predictive performance, rapid detection of, and adaptation to, concept drift, and increased scalability to high volume or high velocity data. Online Random Forest (ORF) is one such approach to streaming classification problems. We adapted the feature importance metrics of Mean Decrease in Accuracy (MDA) and Mean Decrease in Gini Impurity (MDG), both originally designed for offline Random Forest, to Online Random Forest so that they evolve with time and concept drift. Our work is novel in that previous streaming models have not provided any measures of feature importance. We experimentally tested our Online Random Forest versions of feature importance against their offline counterparts, and concluded that our approach to tracking the underlying drifting concepts in a simulated data stream is valid.

[1]  Ludmila I. Kuncheva,et al.  Classifier Ensembles for Detecting Concept Change in Streaming Data: Overview and Perspectives , 2008 .

[2]  Vasant Honavar,et al.  Learn++: an incremental learning algorithm for supervised neural networks , 2001, IEEE Trans. Syst. Man Cybern. Part C.

[3]  Carolin Strobl,et al.  A new variable importance measure for random forests with missing data , 2012, Statistics and Computing.

[4]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[5]  Padraig Cunningham,et al.  A case-based technique for tracking concept drift in spam filtering , 2004, Knowl. Based Syst..

[6]  X. Chen,et al.  Random forests for genomic data analysis. , 2012, Genomics.

[7]  Horst Bischof,et al.  On-line Random Forests , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[8]  Marcus A. Maloof,et al.  Dynamic weighted majority: a new ensemble method for tracking concept drift , 2003, Third IEEE International Conference on Data Mining.

[9]  Stuart J. Russell,et al.  Online bagging and boosting , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[10]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[11]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[12]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[13]  Carlo Zaniolo,et al.  Mining Noisy Data Streams via a Discriminative Model , 2004, Discovery Science.

[14]  Ludmila I. Kuncheva,et al.  Determining the Training Window for Small Sample Size Classification with Concept Drift , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[17]  Marcus A. Maloof,et al.  Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts , 2007, J. Mach. Learn. Res..