Document Representation with Statistical Word Senses in Cross-Lingual Document Clustering

Cross-lingual document clustering is the task of automatically organizing a large collection of multi-lingual documents into a few clusters, depending on their content or topic. It is well known that language barrier and translation ambiguity are two challenging issues for cross-lingual document representation. To this end, we propose to represent cross-lingual documents through statistical word senses, which are automatically discovered from a parallel corpus through a novel cross-lingual word sense induction model and a sense clustering method. In particular, the former consists in a sense-based vector space model and the latter leverages on a sense-based latent Dirichlet allocation. Evaluation on the benchmarking datasets shows that the proposed models outperform two state-of-the-art methods for cross-lingual document clustering.

[1]  P. C. Wong,et al.  Generalized vector spaces model in information retrieval , 1985, SIGIR '85.

[2]  Jian Hu,et al.  Mining multilingual topics from wikipedia , 2009, WWW '09.

[3]  Yue Lu,et al.  Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA , 2011, Information Retrieval.

[4]  Raymond Y. K. Lau,et al.  A Probabilistic Generative Model for Mining Cybercriminal Networks from Online Social Media , 2014, IEEE Computational Intelligence Magazine.

[5]  John Tait,et al.  Word sense disambiguation in information retrieval revisited , 2003, SIGIR.

[6]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[7]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8]  Hubert Jin,et al.  The BBN Crosslingual Topic Detection and Tracking System , 2007 .

[9]  Xuchen Yao,et al.  Nonparametric Bayesian Word Sense Induction , 2011, Graph-based Methods for Natural Language Processing.

[10]  Erik Cambria,et al.  Jumping NLP Curves: A Review of Natural Language Processing Research [Review Article] , 2014, IEEE Computational Intelligence Magazine.

[11]  Bin Zhou,et al.  A Word Position-Related LDA Model , 2011, Int. J. Pattern Recognit. Artif. Intell..

[12]  WeiChih-Ping,et al.  A Latent Semantic Indexing-based approach to multilingual document clustering , 2008 .

[13]  David M. Blei,et al.  Multilingual Topic Models for Unaligned Text , 2009, UAI.

[14]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[16]  Mohamed S. Kamel,et al.  Statistical semantics for enhancing document clustering , 2011, Knowledge and Information Systems.

[17]  Mirella Lapata,et al.  Bayesian Word Sense Induction , 2009, EACL.

[18]  Yau-Hwang Kuo,et al.  Cross-Lingual Document Representation and Semantic Similarity Measure: A Fuzzy Set and Rough Set Based Approach , 2010, IEEE Transactions on Fuzzy Systems.

[19]  Chih-Ping Wei,et al.  A Latent Semantic Indexing-based approach to multilingual document clustering , 2008, Decis. Support Syst..

[20]  Shawe-TaylorJohn,et al.  Advanced learning algorithms for cross-language patent retrieval and classification , 2007 .

[21]  Rada Mihalcea,et al.  A Highly Accurate Bootstrapping Algorithm for Word Sense Disambiguation , 2001, Int. J. Artif. Intell. Tools.

[22]  Marianna Apidianaki,et al.  Data-Driven Semantic Analysis for Multilingual WSD and Lexical Selection in Translation , 2009, EACL.

[23]  Haizhou Li,et al.  CLGVSM: Adapting Generalized Vector Space Model to Cross-lingual Document Clustering , 2011, IJCNLP.

[24]  Jan O. Pedersen Information Retrieval Based on Word Senses , 1995 .

[25]  Kumiko Tanaka-Ishii,et al.  Multilingual Spectral Clustering Using Document Similarity Propagation , 2009, EMNLP.

[26]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[27]  David Evans,et al.  A Platform for Multilingual News Summarization , 2003 .

[28]  Steffen Staab,et al.  Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[29]  Paolo Gastaldo,et al.  An ELM-based model for affective analogical reasoning , 2015, Neurocomputing.

[30]  Eneko Agirre,et al.  Semeval-2007 Task 2 : Evaluating Word Sense Induction and Discrimination , 2007 .

[31]  Kazuaki Kishida Double-pass clustering technique for multilingual document collections , 2011, J. Inf. Sci..

[32]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[33]  Andrew McCallum,et al.  Polylingual Topic Models , 2009, EMNLP.

[34]  Lei Shi,et al.  Cross Language Text Classification by Model Translation and Semi-Supervised Learning , 2010, EMNLP.

[35]  Rui Xia,et al.  Feature Ensemble Plus Sample Selection: Domain Adaptation for Sentiment Classification , 2013, IEEE Intelligent Systems.

[36]  Roberto Navigli,et al.  Inducing Word Senses to Improve Web Search Result Clustering , 2010, EMNLP.

[37]  John Shawe-Taylor,et al.  Advanced learning algorithms for cross-language patent retrieval and classification , 2007, Inf. Process. Manag..

[38]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[39]  Romaric Besançon,et al.  Multilingual document clusters discovery , 2004, RIAO.

[40]  Michael J. Denkowski,et al.  A Survey of Techniques for Unsupervised Word Sense Induction , 2009 .

[41]  Bruno Pouliquen,et al.  Multilingual and cross-lingual news topic tracking , 2004, COLING.

[42]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[43]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[44]  Erik Cambria,et al.  Common Sense Computing: From the Society of Mind to Digital Intuition and beyond , 2009, COST 2101/2102 Conference.

[45]  Jianyong Duan,et al.  Multi-Engine Collaborative Bootstrapping for Word Sense Disambiguation , 2007, Int. J. Artif. Intell. Tools.

[46]  Massih-Reza Amini,et al.  Improving document clustering in a learned concept space , 2010, Inf. Process. Manag..