CLGVSM: Adapting Generalized Vector Space Model to Cross-lingual Document Clustering

Cross-lingual document clustering (CLDC) is the task to automatically organize a large collection of cross-lingual documents into groups considering content or topic. Different from the traditional hard matching strategy, this paper extends traditional generalized vector space model (GVSM) to handle cross-lingual cases, referred to as CLGVSM, by incorporating cross-lingual word similarity measures. With this model, we further compare different word similarity measures in cross-lingual document clustering. To select cross-lingual features effectively, we also propose a softmatching based feature selection method in CLGVSM. Experimental results on benchmarking data set show that (1) the proposed CLGVSM is very effective for cross-document clustering, outperforming the two strong baselines vector space model (VSM) and latent semantic analysis (LSA) significantly; and (2) the new feature selection method can further improve CLGVSM.

[1]  Kumiko Tanaka-Ishii,et al.  Multilingual Spectral Clustering Using Document Similarity Propagation , 2009, EMNLP.

[2]  Mohamed S. Kamel,et al.  Statistical semantics for enhancing document clustering , 2011, Knowledge and Information Systems.

[3]  Chih-Ping Wei,et al.  A Latent Semantic Indexing-based approach to multilingual document clustering , 2008, Decis. Support Syst..

[4]  Hubert Jin,et al.  The BBN Crosslingual Topic Detection and Tracking System , 2007 .

[5]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[6]  A. Schultz Explicit vs . Latent Concept Models for Cross-Language Information Retrieval , 2009 .

[7]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[8]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[9]  Rada Mihalcea,et al.  Measuring the Semantic Similarity of Texts , 2005, EMSEE@ACL.

[10]  Ellen M. Voorhees,et al.  Implementing agglomerative hierarchic clustering algorithms for use in document retrieval , 1986, Inf. Process. Manag..

[11]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[12]  Bruno Pouliquen,et al.  Multilingual and cross-lingual news topic tracking , 2004, COLING.

[13]  Qiang Dong,et al.  Hownet And The Computation Of Meaning , 2006 .

[14]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[15]  P. C. Wong,et al.  Generalized vector spaces model in information retrieval , 1985, SIGIR '85.

[16]  John Shawe-Taylor,et al.  Advanced learning algorithms for cross-language patent retrieval and classification , 2007, Inf. Process. Manag..

[17]  Steffen Staab,et al.  Explicit Versus Latent Concept Models for Cross-Language Information Retrieval , 2009, IJCAI.

[18]  Diana Inkpen,et al.  Semantic text similarity using corpus-based word similarity and string similarity , 2008, ACM Trans. Knowl. Discov. Data.

[19]  Thomas G. Dietterich Machine-Learning Research , 1997, AI Mag..

[20]  Peng Jin,et al.  Measuring Chinese-English Cross-Lingual Word Similarity with HowNet and Parallel Corpus , 2011, CICLing.

[21]  Diana Inkpen,et al.  Second Order Co-occurrence PMI for Determining the Semantic Similarity of Words , 2006, LREC.

[22]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[23]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[24]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[25]  David Evans,et al.  A Platform for Multilingual News Summarization , 2003 .

[26]  Roger K. Moore Computer Speech and Language , 1986 .

[27]  Romaric Besançon,et al.  Multilingual document clusters discovery , 2004, RIAO.

[28]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[29]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.