An ensemble-based approach to fast classification of multi-label data streams

Network operators are continuously confronted with online events, such as online messages, blog updates, etc. Due to the huge volume of these events and the fast changes of the topics, it is critical to manage them promptly and effectively. There have been many softwares and algorithms developed to conduct automatic classification over these stream data. Conventional approaches focus on single-label scenarios, where each event can only be tagged with one label. However, in many stream data, each event can be tagged with more than one labels. Effective stream classification systems should be able to consider the unique properties of multi-label stream data, such as large data volumes, label correlations and concept drifts. To address these challenges, in this paper, we propose an efficient and effective method for multi-label stream classification based on an ensemble of fading random trees. The proposed model can efficiently process high-speed multi-label stream data with concept drifts. Empirical studies on real-world tasks demonstrate that our method can maintain a high accuracy in multi-label stream classification, while providing a very efficient solution to the task.

[1]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[2]  João Gama,et al.  Issues in evaluation of stream learning algorithms , 2009, KDD.

[3]  Philip S. Yu,et al.  On demand classification of data streams , 2004, KDD.

[4]  Yihong Gong,et al.  Multi-labelled classification using maximum entropy method , 2005, SIGIR '05.

[5]  Jason Weston,et al.  A kernel method for multi-labelled classification , 2001, NIPS.

[6]  Marcel Worring,et al.  The challenge problem for automated detection of 101 semantic concepts in multimedia , 2006, MM '06.

[7]  Zhong Wang,et al.  Multi-label Classification without the Multi-label Cost , 2010, SDM.

[8]  Grigorios Tsoumakas,et al.  Random k -Labelsets: An Ensemble Method for Multilabel Classification , 2007, ECML.

[9]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[10]  ZhouZhi-Hua,et al.  Multilabel Neural Networks with Applications to Functional Genomics and Text Categorization , 2006 .

[11]  Philip S. Yu,et al.  LOCUST: An Online Analytical Processing Framework for High Dimensional Classification of Data Streams , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[12]  Eyke Hüllermeier,et al.  Bayes Optimal Multilabel Classification via Probabilistic Classifier Chains , 2010, ICML.

[13]  Andrew McCallum,et al.  Collective multi-label classification , 2005, CIKM '05.

[14]  Eisaku Maeda,et al.  Maximal Margin Labeling for Multi-Topic Text Categorization , 2004, NIPS.

[15]  Grigorios Tsoumakas,et al.  Dealing with Concept Drift and Class Imbalance in Multi-Label Stream Classification , 2011, IJCAI.

[16]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[17]  Sunita Sarawagi,et al.  Discriminative Methods for Multi-labeled Classification , 2004, PAKDD.

[18]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[19]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[20]  Jesse Read,et al.  Scalable Multi-label Classification , 2010 .

[21]  Ruoming Jin,et al.  Efficient decision tree construction on streaming data , 2003, KDD '03.

[22]  Naonori Ueda,et al.  Parametric Mixture Models for Multi-Labeled Text , 2002, NIPS.

[23]  Grigorios Tsoumakas,et al.  Multi-Label Classification , 2009, Database Technologies: Concepts, Methodologies, Tools, and Applications.

[24]  Zhi-Hua Zhou,et al.  ML-KNN: A lazy learning approach to multi-label learning , 2007, Pattern Recognit..

[25]  Rémi Gilleron,et al.  Learning Multi-label Alternating Decision Trees from Texts and Data , 2003, MLDM.

[26]  Geoffrey Holmes,et al.  Efficient multi-label classification for evolving data streams , 2010 .

[27]  João Gama,et al.  Accurate decision trees for mining high-speed data streams , 2003, KDD '03.

[28]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[29]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[30]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[31]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[32]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[33]  David B. Skillicorn,et al.  Classifying Evolving Data Streams Using Dynamic Streaming Random Forests , 2008, DEXA.

[34]  Jiebo Luo,et al.  Learning multi-label scene classification , 2004, Pattern Recognit..

[35]  Zhi-Hua Zhou,et al.  Multilabel Neural Networks with Applications to Functional Genomics and Text Categorization , 2006, IEEE Transactions on Knowledge and Data Engineering.

[36]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[37]  Yi Liu,et al.  Semi-supervised Multi-label Learning by Constrained Non-negative Matrix Factorization , 2006, AAAI.

[38]  Rong Jin,et al.  Correlated Label Propagation with Application to Multi-label Learning , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[39]  Bhavani M. Thuraisingham,et al.  A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[40]  Philip S. Yu,et al.  Is random model better? On its accuracy and efficiency , 2003, Third IEEE International Conference on Data Mining.