Balanced random forest for imbalanced data streams

Data with highly imbalanced class distributions are common in real life. Machine learning application domains such as e-commerce, risk management, environmental, and health monitoring often suffer from class imbalance since the interesting case occurs rarely. Yet another layer of complexity is added when data arrives as massive streams. In such a setting, it is often of interest that a learning algorithm is updated in an incremental fashion for scalability and model adaptivity reasons while still handling the class imbalance. In this paper, we propose an ensemble algorithm for imbalanced data streams based on the offline balanced random forest idea. We also show on a recent dataset that the algorithm is useful for the buyer prediction problem in large-scale recommender systems.

[1]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[2]  Mohamed Medhat Gaber,et al.  Data Stream Mining , 2010, Data Mining and Knowledge Discovery Handbook.

[3]  Lior Rokach,et al.  RecSys Challenge 2015 and the YOOCHOOSE Dataset , 2015, RecSys.

[4]  Albert Bifet,et al.  DATA STREAM MINING A Practical Approach , 2009 .

[5]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[6]  Fikret S. Gürgen,et al.  An ensemble approach for multi-label classification of item click sequences , 2015, RecSys Challenge.

[7]  Thomas Seidl,et al.  MOA: Massive Online Analysis, a Framework for Stream Classification and Clustering , 2010, WAPA.

[8]  Jesús S. Aguilar-Ruiz,et al.  Knowledge discovery from data streams , 2009, Intell. Data Anal..

[9]  Stuart J. Russell,et al.  Online bagging and boosting , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[10]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[11]  Rómer Rosales,et al.  Simple and Scalable Response Prediction for Display Advertising , 2014, ACM Trans. Intell. Syst. Technol..