Exploiting associations between word clusters and document classes for cross-domain text categorization

Cross-domain text categorization targets on adapting the knowledge learnt from a labeled source domain to an unlabeled target domain, where the documents from the source and target domains are drawn from different distributions. However, in spite of the different distributions in raw-word features, the associations between word clusters (conceptual features) and document classes may remain stable across different domains. In this paper, we exploit these unchanged associations as the bridge of knowledge transformation from the source domain to the target domain by the non-negative matrix tri-factorization. Specifically, we formulate a joint optimization framework of the two matrix tri-factorizations for the source- and target-domain data, respectively, in which the associations between word clusters and document classes are shared between them. Then, we give an iterative algorithm for this optimization and theoretically show its convergence. The comprehensive experiments show the effectiveness of this method. In particular, we show that the proposed method can deal with some difficult scenarios where baseline methods usually do not perform well. © 2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 4: 100–114, 2011 (This is an invited submission from the Best of SDM 2010.)

[1]  Chris H. Q. Ding,et al.  Knowledge transformation for cross-domain sentiment classification , 2009, SIGIR.

[2]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[3]  Dacheng Tao,et al.  Evolutionary Cross-Domain Discriminative Hessian Eigenmaps , 2010, IEEE Transactions on Image Processing.

[4]  Yizhou Sun,et al.  Heterogeneous source consensus learning via decision propagation and negotiation , 2009, KDD.

[5]  ChengXiang Zhai,et al.  Instance Weighting for Domain Adaptation in NLP , 2007, ACL.

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  Qiang Yang,et al.  Translated Learning: Transfer Learning across Different Feature Spaces , 2008, NIPS.

[8]  Chris H. Q. Ding,et al.  Bridging Domains with Words: Opinion Analysis with Matrix Tri-factorizations , 2010, SDM.

[9]  Éric Gaussier,et al.  Relation between PLSA and NMF and implications , 2005, SIGIR '05.

[10]  Chris H. Q. Ding,et al.  Knowledge transformation from word space to document space , 2008, SIGIR '08.

[11]  Qiang Yang,et al.  Co-clustering based classification for out-of-domain documents , 2007, KDD '07.

[12]  Daniel D. Lee,et al.  Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines , 2002, NIPS.

[13]  Chris H. Q. Ding,et al.  Orthogonal nonnegative matrix t-factorizations for clustering , 2006, KDD '06.

[14]  Fei Wang,et al.  Semi-Supervised Clustering via Matrix Factorization , 2008, SDM.

[15]  ChengXiang Zhai,et al.  A two-stage approach to domain adaptation for statistical classifiers , 2007, CIKM '07.

[16]  D. Hosmer,et al.  Applied Logistic Regression , 1991 .

[17]  Jordi Vitrià,et al.  Non-negative Matrix Factorization for Face Recognition , 2002, CCIA.

[18]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[19]  Bernt Schiele,et al.  Introducing a weighted non-negative matrix factorization for image classification , 2003, Pattern Recognit. Lett..

[20]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[21]  ChengXiang Zhai,et al.  Domain Adaptation in Natural Language Processing , 2008 .

[22]  Jiawei Han,et al.  Knowledge transfer via multiple model local structure mapping , 2008, KDD.

[23]  Hui Xiong,et al.  Exploiting Associations between Word Clusters and Document Classes for Cross-Domain Text Categorization , 2010, SDM.

[24]  Hui Xiong,et al.  Cross-Domain Learning from Multiple Sources: A Consensus Regularization Perspective , 2010, IEEE Transactions on Knowledge and Data Engineering.

[25]  Dacheng Tao,et al.  Bregman Divergence-Based Regularization for Transfer Subspace Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[26]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[27]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[28]  Qiang Yang,et al.  Boosting for transfer learning , 2007, ICML '07.

[29]  Hui Xiong,et al.  Transfer learning from multiple source domains via consensus regularization , 2008, CIKM '08.

[30]  Qiang Yang,et al.  Transfer Learning via Dimensionality Reduction , 2008, AAAI.

[31]  Qiang Yang,et al.  Can Movies and Books Collaborate? Cross-Domain Collaborative Filtering for Sparsity Reduction , 2009, IJCAI.