Mining Distinction and Commonality across Multiple Domains Using Generative Model for Text Classification

The distribution difference among multiple domains has been exploited for cross-domain text categorization in recent years. Along this line, we show two new observations in this study. First, the data distribution difference is often due to the fact that different domains use different index words to express the same concept. Second, the association between the conceptual feature and the document class can be stable across domains. These two observations actually indicate the distinction and commonality across domains. Inspired by the above observations, we propose a generative statistical model, named Collaborative Dual-PLSA (CD-PLSA), to simultaneously capture both the domain distinction and commonality among multiple domains. Different from Probabilistic Latent Semantic Analysis (PLSA) with only one latent variable, the proposed model has two latent factors y and z, corresponding to word concept and document class, respectively. The shared commonality intertwines with the distinctions over multiple domains, and is also used as the bridge for knowledge transformation. An Expectation Maximization (EM) algorithm is developed to solve the CD-PLSA model, and further its distributed version is exploited to avoid uploading all the raw data to a centralized location and help to mitigate privacy concerns. After the training phase with all the data from multiple domains we propose to refine the immediate outputs using only the corresponding local data. In summary, we propose a two-phase method for cross-domain text classification, the first phase for collaborative training with all the data, and the second step for local refinement. Finally, we conduct extensive experiments over hundreds of classification tasks with multiple source domains and multiple target domains to validate the superiority of the proposed method over existing state-of-the-art methods of supervised and transfer learning. It is noted to mention that as shown by the experimental results CD-PLSA for the collaborative training is more tolerant of distribution differences, and the local refinement also gains significant improvement in terms of classification accuracy.

[1]  Hui Xiong,et al.  Transfer learning from multiple source domains via consensus regularization , 2008, CIKM '08.

[2]  Qiang Yang,et al.  Transfer Learning via Dimensionality Reduction , 2008, AAAI.

[3]  Jiawei Han,et al.  Knowledge transfer via multiple model local structure mapping , 2008, KDD.

[4]  Seungjin Choi,et al.  Probabilistic matrix tri-factorization , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Qiang Yang,et al.  Topic-bridged PLSA for cross-domain text classification , 2008, SIGIR '08.

[6]  Qiang Yang,et al.  Co-clustering based classification for out-of-domain documents , 2007, KDD '07.

[7]  ChengXiang Zhai,et al.  A two-stage approach to domain adaptation for statistical classifiers , 2007, CIKM '07.

[8]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[9]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10]  Bei Yu,et al.  A cross-collection mixture model for comparative text mining , 2004, KDD.

[11]  Koby Crammer,et al.  Multi-domain learning by confidence-weighted parameter combination , 2010, Machine Learning.

[12]  Qiang Yang,et al.  Boosting for transfer learning , 2007, ICML '07.

[13]  Sean Borman,et al.  The Expectation Maximization Algorithm A short tutorial , 2006 .

[14]  Hui Xiong,et al.  Collaborative Dual-PLSA: mining distinction and commonality across multiple domains for text classification , 2010, CIKM.

[15]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[16]  Qiang Yang,et al.  Translated Learning: Transfer Learning across Different Feature Spaces , 2008, NIPS.

[17]  Yong Yu,et al.  Bridged Refinement for Transfer Learning , 2007, PKDD.

[18]  Quanquan Gu,et al.  Learning the Shared Subspace for Multi-task Clustering and Transductive Transfer Classification , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[19]  Hui Xiong,et al.  Exploiting Associations between Word Clusters and Document Classes for Cross-Domain Text Categorization , 2010, SDM.

[20]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[21]  Charles Elkan,et al.  Expectation Maximization Algorithm , 2010, Encyclopedia of Machine Learning.

[22]  Yizhou Sun,et al.  Heterogeneous source consensus learning via decision propagation and negotiation , 2009, KDD.

[23]  ChengXiang Zhai,et al.  Instance Weighting for Domain Adaptation in NLP , 2007, ACL.

[24]  Jing Peng,et al.  Latent space domain transfer between high dimensional overlapping distributions , 2009, WWW '09.

[25]  Qiang Yang,et al.  EigenTransfer: a unified framework for transfer learning , 2009, ICML '09.