Knowledge Transfer across Multilingual Corpora via Latent Topics

This paper explores bridging the content of two different languages via latent topics. Specifically, we propose a unified probabilistic model to simultaneously model latent topics from bilingual corpora that discuss comparable content and use the topics as features in a cross-lingual, dictionary-less text categorization task. Experimental results on multilingual Wikipedia data show that the proposed topic model effectively discovers the topic information from the bilingual corpora, and the learned topics successfully transfer classification knowledge to other languages, for which no labeled training data are available.

[1]  John Blitzer,et al.  Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[2]  Jian Hu,et al.  Mining multilingual topics from wikipedia , 2009, WWW '09.

[3]  Tatsunori Mori,et al.  Integration of PLSA into Probabilistic CLIR Model - Yokohama National University at NTCIR4 CLIR , 2004, NTCIR.

[4]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[5]  Tamara G. Kolda,et al.  Cross-language information retrieval using PARAFAC2 , 2007, KDD '07.

[6]  Qiang Yang,et al.  Topic-bridged PLSA for cross-domain text classification , 2008, SIGIR '08.

[7]  Marie-Francine Moens,et al.  Cross-language linking of news stories on the web using interlingual topic modelling , 2009, CIKM-SWSM.

[8]  Douglas W. Oard,et al.  Cross-language text classification , 2005, SIGIR '05.

[9]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  Jacques Savoy,et al.  Combining Multiple Strategies for Effective Monolingual and Cross-Language Retrieval , 2004, Information Retrieval.

[12]  Andrew McCallum,et al.  Polylingual Topic Models , 2009, EMNLP.

[13]  Carlo Strapparava,et al.  Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization , 2006, ACL.

[14]  Gregory Grefenstette,et al.  Cross-Language Information Retrieval , 1998, The Springer International Series on Information Retrieval.

[15]  Yiming Yang,et al.  Translingual Information Retrieval: Learning from Bilingual Corpora , 1998, Artif. Intell..

[16]  Jian-Yun Nie,et al.  Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web , 1999, SIGIR '99.

[17]  Qiang Yang,et al.  Boosting for transfer learning , 2007, ICML '07.

[18]  Massimiliano Pontil,et al.  Multi-Task Feature Learning , 2006, NIPS.

[19]  Jiawei Han,et al.  Knowledge transfer via multiple model local structure mapping , 2008, KDD.

[20]  Massih-Reza Amini,et al.  A co-classification approach to learning from multilingual corpora , 2010, Machine Learning.

[21]  Xiaojun Wan,et al.  Co-Training for Cross-Lingual Sentiment Classification , 2009, ACL.

[22]  Tony Jebara,et al.  Multi-task feature and kernel selection for SVMs , 2004, ICML.

[23]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[24]  Rada Mihalcea,et al.  Learning Multilingual Subjective Language via Cross-Lingual Projections , 2007, ACL.

[25]  Eric P. Xing,et al.  BiTAM: Bilingual Topic AdMixture Models for Word Alignment , 2006, ACL.

[26]  Edwin V. Bonilla,et al.  Multi-task Gaussian Process Prediction , 2007, NIPS.

[27]  Romaric Besançon,et al.  Multilingual document clusters discovery , 2004, RIAO.

[28]  John Shawe-Taylor,et al.  The use of machine translation tools for cross-lingual text mining , 2005 .

[29]  Daphne Koller,et al.  Learning a meta-level prior for feature relevance from multiple related tasks , 2007, ICML '07.