论文信息 - Inducing Word Senses for Cross-lingual Document Clustering - 字舞流文

Inducing Word Senses for Cross-lingual Document Clustering

Cross-lingual document clustering is the task of automatically organizing a large collection of cross-lingual documents into a few groups according to their content or topic. It is well known that language barrier and translation ambiguity are two challenging issues for cross-lingual document representation. To address such issues, we propose to represent cross-lingual documents through statistical word senses, which are learned from a parallel corpus by means of a novel cross-lingual word sense induction model. Furthermore, a sense clustering method is adopted to discover semantic relation of word senses, which are used to represent cross-lingual documents through a sense-based vector space model. Evaluation on a benchmarking dataset shows that the proposed model outperforms two state-of-the-art models in cross-lingual document clustering.

Peng Jin | Erik Cambria | Yunqing Xia | Guoyu Tang | E. Cambria | Peng Jin | Yunqing Xia | Guoyu Tang

[1] Hinrich Schütze,et al. Information retrieval based on word senses , 1995 .

[2] Chih-Ping Wei,et al. A Latent Semantic Indexing-based approach to multilingual document clustering , 2008, Decis. Support Syst..

[3] T. Landauer,et al. A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[4] George Karypis,et al. A Comparison of Document Clustering Techniques , 2000 .

[5] Marianna Apidianaki. Translation-oriented Word Sense Induction Based on Parallel Corpora , 2008, LREC.

[6] Andrew McCallum,et al. Polylingual Topic Models , 2009, EMNLP.

[7] Michael J. Denkowski,et al. A Survey of Techniques for Unsupervised Word Sense Induction , 2009 .

[8] Lei Shi,et al. Cross Language Text Classification by Model Translation and Semi-Supervised Learning , 2010, EMNLP.

[9] Steffen Staab,et al. WordNet improves text document clustering , 2003, SIGIR 2003.

[10] Evgeniy Gabrilovich,et al. Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[11] Roberto Navigli,et al. Word sense disambiguation: A survey , 2009, CSUR.

[12] Eneko Agirre,et al. Semeval-2007 Task 2 : Evaluating Word Sense Induction and Discrimination , 2007 .

[13] Kazuaki Kishida. Double-pass clustering technique for multilingual document collections , 2011, J. Inf. Sci..

[14] Haizhou Li,et al. CLGVSM: Adapting Generalized Vector Space Model to Cross-lingual Document Clustering , 2011, IJCNLP.

[15] Michael I. Jordan,et al. Hierarchical Dirichlet Processes , 2006 .

[16] John Tait,et al. Word sense disambiguation in information retrieval revisited , 2003, SIGIR.

[17] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[18] Dan Tufis,et al. Ontology-Supported Text Classification Based on Cross-Lingual Word Sense Disambiguation , 2007, WILF.

[19] Romaric Besançon,et al. Multilingual document clusters discovery , 2004, RIAO.

[20] Roberto Navigli,et al. Inducing Word Senses to Improve Web Search Result Clustering , 2010, EMNLP.

[21] John Shawe-Taylor,et al. Advanced learning algorithms for cross-language patent retrieval and classification , 2007, Inf. Process. Manag..

[22] Mark Steyvers,et al. Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[23] David Evans,et al. A Platform for Multilingual News Summarization , 2003 .

[24] Xuchen Yao,et al. Nonparametric Bayesian Word Sense Induction , 2011, Graph-based Methods for Natural Language Processing.

[25] Gerard Salton,et al. A vector space model for automatic indexing , 1975, CACM.