Multi-Label Regularized Generative Model for Semi-Supervised Collective Classification in Large-Scale Networks

The problem of collective classification (CC) for large-scale network data has received considerable attention in the last decade. Enabling CC usually increases accuracy when given a fully-labeled network with a large amount of labeled data. However, such labels can be difficult to obtain and learning a CC model with only a few such labels in large-scale sparsely labeled networks can lead to poor performance. In this paper, we show that leveraging the unlabeled portion of the data through semi-supervised collective classification (SSCC) is essential to achieving high performance. First, we describe a novel data-generating algorithm, called generative model with network regularization (GMNR), to exploit both labeled and unlabeled data in large-scale sparsely labeled networks. In GMNR, a network regularizer is constructed to encode the network structure information, and we apply the network regularizer to smooth the probability density functions of the generative model. Second, we extend our proposed GMNR algorithm to handle network data consisting of multi-label instances. This approach, called the multi-label regularized generative model (MRGM), includes an additional label regularizer to encode the label correlation, and we show how these smoothing regularizers can be incorporated into the objective function of the model to improve the performance of CC in multi-label setting. We then develop an optimization scheme to solve the objective function based on EM algorithm. Empirical results on several real-world network data classification tasks show that our proposed methods are better than the compared collective classification algorithms especially when labeled data is scarce.

[1]  Jennifer Neville,et al.  Relational Dependency Networks , 2007, J. Mach. Learn. Res..

[2]  Philip S. Yu,et al.  Multi-Label Collective Classification , 2011, SDM.

[3]  Min-Ling Zhang,et al.  A Review on Multi-Label Learning Algorithms , 2014, IEEE Transactions on Knowledge and Data Engineering.

[4]  Lise Getoor,et al.  Collective Classification in Network Data , 2008, AI Mag..

[5]  Foster J. Provost,et al.  Classification in Networked Data: a Toolkit and a Univariate Case Study , 2007, J. Mach. Learn. Res..

[6]  Philip S. Yu,et al.  Collective prediction with latent graphs , 2011, CIKM '11.

[7]  Ben Taskar,et al.  Discriminative Probabilistic Models for Relational Data , 2002, UAI.

[8]  Jennifer Neville,et al.  Why collective inference improves relational classification , 2004, KDD.

[9]  Jiawei Han,et al.  Modeling hidden topics on document manifold , 2008, CIKM '08.

[10]  Xiaojun Wu,et al.  Graph Regularized Nonnegative Matrix Factorization for Data Representation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[12]  Kalyan Moy Gupta,et al.  Cautious Collective Classification , 2009, J. Mach. Learn. Res..

[13]  Huan Liu,et al.  Relational learning via latent social dimensions , 2009, KDD.

[14]  Grigorios Tsoumakas,et al.  Random K-labelsets for Multilabel Classification , 2022 .

[15]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[16]  Yunming Ye,et al.  A Generative Model with Network Regularization for Semi-Supervised Collective Classification , 2014, SDM.

[17]  David W. Aha,et al.  Labels or attributes?: rethinking the neighbors for collective classification in sparsely-labeled networks , 2013, CIKM.

[18]  Mustafa Bilgic,et al.  Cost-sensitive Information Acquisition in Structured Domains , 2010 .

[19]  L. Getoor,et al.  Link-Based Classification , 2003, Encyclopedia of Machine Learning and Data Mining.

[20]  David W. Aha,et al.  Semi-Supervised Collective Classification via Hybrid Label Regularization , 2012, ICML.

[21]  Jennifer Neville,et al.  Pseudolikelihood EM for Within-network Relational Learning , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[22]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[23]  Benjamin W. Wah,et al.  Significance and Challenges of Big Data Research , 2015, Big Data Res..

[24]  Lise Getoor,et al.  Link-based Classifi-cation using Labeled and Unlabeled Data , 2003 .

[25]  Yuanqing Xia,et al.  Networked Data Fusion With Packet Losses and Variable Delays , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[26]  Lise Getoor,et al.  Active Learning for Networked Data , 2010, ICML.

[27]  Christos Faloutsos,et al.  Using ghost edges for classification in sparsely labeled networks , 2008, KDD.

[28]  Deng Cai,et al.  Probabilistic dyadic data analysis with local and global consistency , 2009, ICML '09.

[29]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[30]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[31]  Jennifer Neville,et al.  Iterative Classification in Relational Data , 2000 .

[32]  David Page,et al.  KDD Cup 2001 report , 2002, SKDD.