论文信息 - Cross-language information retrieval by reduced k- means - 字舞流文

Cross-language information retrieval by reduced k- means

Cross-language information retrieval aims at retrieving relevant documents in one language for a query set in another language. Here we propose a new approach to the problem of cross-language information retrieval based on factorization of a term-document matrix by an iterative method of Reduced k-means clustering. Method of Reduced k- means intended at simultaneous reduction of objects (documents) and variables (index terms). Proposed method is compared to standard machine learning techniques of cross-language information retrieval by usage of latent semantic indexing and canonical correlation analysis. Motivation for usage of Reduced k-means method for a task of cross-language information retrieval comes from an observation that documents in a semantic space obtained by method of latent semantic indexing are clustered by their language and not by their topics in the first place. As Reduced k-means aims at preserving clustering structure of data, the idea is that the proposed method could address the mentioned problem.

Dunja Mladenic | Jasminka Dobša | Danijel Radošević | Jan Rupnik | Ivan Magdalenić

[1] W. DeSarbo,et al. Simultaneous multidimensional unfolding and cluster analysis: An investigation of strategic groups , 1991 .

[2] Andrew McCallum,et al. Polylingual Topic Models , 2009, EMNLP.

[3] H. Hotelling. Relations Between Two Sets of Variates , 1936 .

[4] Núria Bel,et al. Cross-Lingual Text Categorization , 2003, ECDL.

[5] Carlo Strapparava,et al. Cross Language Text Categorization by Acquiring Multilingual Domain Models from Comparable Corpora , 2005, ParallelText@ACL.

[6] John C. Platt,et al. Translingual Document Representations from Discriminative Projections , 2010, EMNLP.

[7] Phil Blunsom,et al. Multilingual Models for Compositional Distributed Semantics , 2014, ACL.

[8] Eva Ceulemans,et al. Factorial and reduced K-means reconsidered , 2010, Comput. Stat. Data Anal..

[9] T. Landauer,et al. Indexing by Latent Semantic Analysis , 1990 .

[10] Marie-Francine Moens,et al. Bilingual Distributed Word Representations from Document-Aligned Comparable Data , 2015, J. Artif. Intell. Res..

[11] John Blitzer,et al. Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[12] Nello Cristianini,et al. Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis , 2002, NIPS.

[13] H. Kiers,et al. Factorial k-means analysis for two-way data , 2001 .

[14] U. Germann. Aligned Hansards of the 36th Parliament of Canada , 2001 .

[15] John Pozniak,et al. Optimization of Cross-Lingual LSI Training Data , 2016 .

[16] E. Ceulemans,et al. Subspace K-means clustering , 2013, Behavior Research Methods.

[17] J. Carroll,et al. K-means clustering in a low-dimensional Euclidean space , 1994 .

[18] John Shawe-Taylor,et al. The use of machine translation tools for cross-lingual text mining , 2005 .

[19] J. Shawe-Taylor,et al. Multi-View Canonical Correlation Analysis , 2010 .

[20] Bruno Pouliquen,et al. Story tracking: linking similar news over time and across languages , 2008, COLING 2008.

[21] Benno Stein,et al. Cross-Language Text Classification Using Structural Correspondence Learning , 2010, ACL.

[22] Michael L. Littman,et al. Automatic Cross-Language Retrieval Using Latent Semantic Indexing , 1997 .

[23] Tamara G. Kolda,et al. Cross-language information retrieval using PARAFAC2 , 2007, KDD '07.

[24] Michael J. Brusco,et al. Principal Cluster Axes: A Projection Pursuit Index for the Preservation of Cluster Structures in the Presence of Data Reduction , 2012, Multivariate behavioral research.

[25] Min Xiao,et al. A Novel Two-Step Method for Cross Language Representation Learning , 2013, NIPS.

[26] Benno Stein,et al. Cross-language plagiarism detection , 2011, Lang. Resour. Evaluation.

[27] Carol Peters,et al. Multilingual Information Retrieval , 2012, Springer Berlin Heidelberg.