Graph Ensemble Boosting for Imbalanced Noisy Graph Stream Classification

Many applications involve stream data with structural dependency, graph representations, and continuously increasing volumes. For these applications, it is very common that their class distributions are imbalanced with minority (or positive) samples being only a small portion of the population, which imposes significant challenges for learning models to accurately identify minority samples. This problem is further complicated with the presence of noise, because they are similar to minority samples and any treatment for the class imbalance may falsely focus on the noise and result in deterioration of accuracy. In this paper, we propose a classification model to tackle imbalanced graph streams with noise. Our method, graph ensemble boosting, employs an ensemble-based framework to partition graph stream into chunks each containing a number of noisy graphs with imbalanced class distributions. For each individual chunk, we propose a boosting algorithm to combine discriminative subgraph pattern selection and model learning as a unified framework for graph classification. To tackle concept drifting in graph streams, an instance level weighting mechanism is used to dynamically adjust the instance weight, through which the boosting framework can emphasize on difficult graph samples. The classifiers built from different graph chunks form an ensemble for graph stream classification. Experiments on real-life imbalanced graph streams demonstrate clear benefits of our boosting design for handling imbalanced noisy graph stream.

[1]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[2]  Phillipp Kaestner,et al.  Linear And Nonlinear Programming , 2016 .

[3]  Jure Leskovec,et al.  Linear Programming Boosting for Uneven Datasets , 2003, ICML.

[4]  Hong Cheng,et al.  Graph classification: a diversified discriminative feature selection approach , 2012, CIKM.

[5]  John N. Tsitsiklis,et al.  Introduction to linear optimization , 1997, Athena scientific optimization and computation series.

[6]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[7]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[8]  Chengqi Zhang,et al.  Multi-Graph Learning with Positive and Unlabeled Bags , 2014, SDM.

[9]  Haibo He,et al.  Towards incremental learning of nonstationary imbalanced data stream: a multiple selectively recursive approach , 2011, Evol. Syst..

[10]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[11]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[12]  Hong Cheng,et al.  Identifying bug signatures using discriminative graph mining , 2009, ISSTA.

[13]  Shirui Pan,et al.  CGStream: continuous correlated graph query for data streams , 2012, CIKM '12.

[14]  Gregory Ditzler,et al.  Incremental Learning of Concept Drift from Streaming Imbalanced Data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[15]  Chengqi Zhang,et al.  Nested Subtree Hash Kernels for Large-Scale Graph Classification over Streams , 2012, 2012 IEEE 12th International Conference on Data Mining.

[16]  Ambuj K. Singh,et al.  GraphSig: A Scalable Approach to Mining Significant Subgraphs in Large Graph Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[17]  Xiaodong Lin,et al.  Active Learning From Stream Data Using Optimal Weight Classifier Ensemble , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[18]  A. Bifet,et al.  Early Drift Detection Method , 2005 .

[19]  Hongliang Fei,et al.  Boosting with structure information in the functional space: an application to graph classification , 2010, KDD.

[20]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[21]  Shirui Pan,et al.  Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Graph Classification with Imbalanced Class Distributions and Noise ∗ , 2022 .

[22]  Kaspar Riesen,et al.  Graph Classification by Means of Lipschitz Embedding , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[23]  Sebastian Nowozin,et al.  gBoost: a mathematical programming approach to graph classification and regression , 2009, Machine Learning.

[24]  D. Luenberger Optimization by Vector Space Methods , 1968 .

[25]  Jie Tang,et al.  ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[26]  Xin Yao,et al.  A learning framework for online class imbalance learning , 2013, 2013 IEEE Symposium on Computational Intelligence and Ensemble Learning (CIEL).

[27]  Charu C. Aggarwal,et al.  On Classification of Graph Streams , 2011, SDM.

[28]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[29]  H. Kashima,et al.  Kernels for graphs , 2004 .

[30]  Taghi M. Khoshgoftaar,et al.  Knowledge discovery from imbalanced and noisy data , 2009, Data Knowl. Eng..

[31]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[32]  George Karypis,et al.  Frequent substructure-based approaches for classifying chemical compounds , 2003, IEEE Transactions on Knowledge and Data Engineering.

[33]  Yuji Matsumoto,et al.  An Application of Boosting to Graph Classification , 2004, NIPS.

[34]  Philip S. Yu,et al.  Semi-supervised feature selection for graph classification , 2010, KDD.

[35]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[36]  Bin Li,et al.  Fast Graph Stream Classification Using Discriminative Clique Hashing , 2013, PAKDD.

[37]  Philip S. Yu,et al.  Graph stream classification using labeled and unlabeled graphs , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[38]  Yunqian Ma,et al.  Imbalanced Learning: Foundations, Algorithms, and Applications , 2013 .

[39]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[40]  Nello Cristianini,et al.  Controlling the Sensitivity of Support Vector Machines , 1999 .

[41]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[42]  Tatsuya Akutsu,et al.  Extensions of marginalized graph kernels , 2004, ICML.

[43]  Philip S. Yu,et al.  A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions , 2007, SDM.