Inducing Word Senses for Cross-lingual Document Clustering

Cross-lingual document clustering is the task of automatically organizing a large collection of cross-lingual documents into a few groups according to their content or topic. It is well known that language barrier and translation ambiguity are two challenging issues for cross-lingual document representation. To address such issues, we propose to represent cross-lingual documents through statistical word senses, which are learned from a parallel corpus by means of a novel cross-lingual word sense induction model. Furthermore, a sense clustering method is adopted to discover semantic relation of word senses, which are used to represent cross-lingual documents through a sense-based vector space model. Evaluation on a benchmarking dataset shows that the proposed model outperforms two state-of-the-art models in cross-lingual document clustering.

[1]  Hinrich Schütze,et al.  Information retrieval based on word senses , 1995 .

[2]  Chih-Ping Wei,et al.  A Latent Semantic Indexing-based approach to multilingual document clustering , 2008, Decis. Support Syst..

[3]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[4]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[5]  Marianna Apidianaki Translation-oriented Word Sense Induction Based on Parallel Corpora , 2008, LREC.

[6]  Andrew McCallum,et al.  Polylingual Topic Models , 2009, EMNLP.

[7]  Michael J. Denkowski,et al.  A Survey of Techniques for Unsupervised Word Sense Induction , 2009 .

[8]  Lei Shi,et al.  Cross Language Text Classification by Model Translation and Semi-Supervised Learning , 2010, EMNLP.

[9]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[10]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[11]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[12]  Eneko Agirre,et al.  Semeval-2007 Task 2 : Evaluating Word Sense Induction and Discrimination , 2007 .

[13]  Kazuaki Kishida Double-pass clustering technique for multilingual document collections , 2011, J. Inf. Sci..

[14]  Haizhou Li,et al.  CLGVSM: Adapting Generalized Vector Space Model to Cross-lingual Document Clustering , 2011, IJCNLP.

[15]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[16]  John Tait,et al.  Word sense disambiguation in information retrieval revisited , 2003, SIGIR.

[17]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[18]  Dan Tufis,et al.  Ontology-Supported Text Classification Based on Cross-Lingual Word Sense Disambiguation , 2007, WILF.

[19]  Romaric Besançon,et al.  Multilingual document clusters discovery , 2004, RIAO.

[20]  Roberto Navigli,et al.  Inducing Word Senses to Improve Web Search Result Clustering , 2010, EMNLP.

[21]  John Shawe-Taylor,et al.  Advanced learning algorithms for cross-language patent retrieval and classification , 2007, Inf. Process. Manag..

[22]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[23]  David Evans,et al.  A Platform for Multilingual News Summarization , 2003 .

[24]  Xuchen Yao,et al.  Nonparametric Bayesian Word Sense Induction , 2011, Graph-based Methods for Natural Language Processing.

[25]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.