Classifying networked text data with positive and unlabeled examples

We present a NMF-based method for PU Learning of networked text data.Our algorithm integrates feature and network information via a consensus principle.Our method deals with networked data with extremely limited positive examples.We demonstrate the effectiveness of our algorithm. The rapid growth in the number of networked applications that naturally generate complex text data, which contains not only inner features but also inter-dependent relations, has created the demand of efficiently classifying such data. Many classification algorithms have been proposed, but they usually require as input fully labeled text examples. In many networked applications, however, the cost to label a text data may be expensive and hence a large amount of text may be unlabeled. In this paper we study the problem of classifying networked text data with only positive and unlabeled examples available. We present a non-negative matrix factorization-based approach to networked text classification by factorizing content matrix of the nodes and topological network structures, and by incorporating supervised information into the learning of objective function via a consensus principle. We propose a novel learning algorithm, namely puNet (positive and unlabeled learning algorithm for Networked text data), for efficiently classifying networked text, even if training datasets contain only a small amount of positive examples and a large amount of unlabeled ones. We conduct a series of experiments on benchmark networked datasets and illustrate the effectiveness of our algorithm.

[1]  Qiang Yang,et al.  Learning with Positive and Unlabeled Examples Using Topic-Sensitive PLSA , 2010, IEEE Transactions on Knowledge and Data Engineering.

[2]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[3]  Foster J. Provost,et al.  Classification in Networked Data: a Toolkit and a Univariate Case Study , 2007, J. Mach. Learn. Res..

[4]  Wei Cheng,et al.  Flexible and robust co-regularized multi-domain graph clustering , 2013, KDD.

[5]  Philip S. Yu,et al.  Positive and Unlabeled Learning for Graph Classification , 2011, 2011 IEEE 11th International Conference on Data Mining.

[6]  Liang Du,et al.  Towards Robust Co-Clustering , 2013, IJCAI.

[7]  Pedro Larrañaga,et al.  Learning Bayesian classifiers from positive and unlabeled examples , 2007, Pattern Recognit. Lett..

[8]  Charles Elkan,et al.  Learning classifiers from only positive and unlabeled data , 2008, KDD.

[9]  Kevin Chen-Chuan Chang,et al.  PEBL: positive example based learning for Web page classification using SVM , 2002, KDD.

[10]  Xiaoli Li,et al.  Learning to Classify Texts Using Positive and Unlabeled Data , 2003, IJCAI.

[11]  Chris H. Q. Ding,et al.  Convex and Semi-Nonnegative Matrix Factorizations , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Chengqi Zhang,et al.  Multi-Graph Learning with Positive and Unlabeled Bags , 2014, SDM.

[13]  Lise Getoor,et al.  Collective Classification in Network Data , 2008, AI Mag..

[14]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[15]  Jiawei Han,et al.  Non-negative Matrix Factorization on Manifold , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[16]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[17]  Wanli Zuo,et al.  Learning from Positive and Unlabeled Examples: A Survey , 2008, 2008 International Symposiums on Information Processing.

[18]  Jean-Philippe Vert,et al.  Consistency and Convergence Rates of One-Class SVMs and Related Algorithms , 2006, J. Mach. Learn. Res..

[19]  Peng Shi,et al.  Learning very fast decision tree from uncertain data streams with positive and unlabeled samples , 2012, Inf. Sci..

[20]  Ken Wakita,et al.  Finding community structure in mega-scale social networks: [extended abstract] , 2007, WWW '07.

[21]  Yong Wang,et al.  Bayesian Classifiers for Positive Unlabeled Learning , 2011, WAIM.

[22]  Daphne Koller,et al.  Learning Continuous Time Bayesian Networks , 2002, UAI.

[23]  Quanquan Gu,et al.  Co-clustering on manifolds , 2009, KDD.

[24]  Chris H. Q. Ding,et al.  Symmetric Nonnegative Matrix Factorization for Graph Clustering , 2012, SDM.

[25]  Rémi Gilleron,et al.  Learning from positive and unlabeled examples , 2000, Theor. Comput. Sci..

[26]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[27]  Philip S. Yu,et al.  Graph stream classification using labeled and unlabeled graphs , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[28]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[29]  Philip S. Yu,et al.  Partially Supervised Classification of Text Documents , 2002, ICML.

[30]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[31]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[32]  Xue Li,et al.  Dynamic classifier ensemble for positive unlabeled text stream classification , 2012, Knowledge and Information Systems.

[33]  Philip S. Yu,et al.  Text classification without negative examples revisit , 2006, IEEE Transactions on Knowledge and Data Engineering.

[34]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[35]  Chris H. Q. Ding,et al.  Orthogonal nonnegative matrix t-factorizations for clustering , 2006, KDD '06.