Cross-language information retrieval by reduced k- means

Cross-language information retrieval aims at retrieving relevant documents in one language for a query set in another language. Here we propose a new approach to the problem of cross-language information retrieval based on factorization of a term-document matrix by an iterative method of Reduced k-means clustering. Method of Reduced k- means intended at simultaneous reduction of objects (documents) and variables (index terms). Proposed method is compared to standard machine learning techniques of cross-language information retrieval by usage of latent semantic indexing and canonical correlation analysis. Motivation for usage of Reduced k-means method for a task of cross-language information retrieval comes from an observation that documents in a semantic space obtained by method of latent semantic indexing are clustered by their language and not by their topics in the first place. As Reduced k-means aims at preserving clustering structure of data, the idea is that the proposed method could address the mentioned problem.

[1]  W. DeSarbo,et al.  Simultaneous multidimensional unfolding and cluster analysis: An investigation of strategic groups , 1991 .

[2]  Andrew McCallum,et al.  Polylingual Topic Models , 2009, EMNLP.

[3]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[4]  Núria Bel,et al.  Cross-Lingual Text Categorization , 2003, ECDL.

[5]  Carlo Strapparava,et al.  Cross Language Text Categorization by Acquiring Multilingual Domain Models from Comparable Corpora , 2005, ParallelText@ACL.

[6]  John C. Platt,et al.  Translingual Document Representations from Discriminative Projections , 2010, EMNLP.

[7]  Phil Blunsom,et al.  Multilingual Models for Compositional Distributed Semantics , 2014, ACL.

[8]  Eva Ceulemans,et al.  Factorial and reduced K-means reconsidered , 2010, Comput. Stat. Data Anal..

[9]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[10]  Marie-Francine Moens,et al.  Bilingual Distributed Word Representations from Document-Aligned Comparable Data , 2015, J. Artif. Intell. Res..

[11]  John Blitzer,et al.  Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[12]  Nello Cristianini,et al.  Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis , 2002, NIPS.

[13]  H. Kiers,et al.  Factorial k-means analysis for two-way data , 2001 .

[14]  U. Germann Aligned Hansards of the 36th Parliament of Canada , 2001 .

[15]  John Pozniak,et al.  Optimization of Cross-Lingual LSI Training Data , 2016 .

[16]  E. Ceulemans,et al.  Subspace K-means clustering , 2013, Behavior Research Methods.

[17]  J. Carroll,et al.  K-means clustering in a low-dimensional Euclidean space , 1994 .

[18]  John Shawe-Taylor,et al.  The use of machine translation tools for cross-lingual text mining , 2005 .

[19]  J. Shawe-Taylor,et al.  Multi-View Canonical Correlation Analysis , 2010 .

[20]  Bruno Pouliquen,et al.  Story tracking: linking similar news over time and across languages , 2008, COLING 2008.

[21]  Benno Stein,et al.  Cross-Language Text Classification Using Structural Correspondence Learning , 2010, ACL.

[22]  Michael L. Littman,et al.  Automatic Cross-Language Retrieval Using Latent Semantic Indexing , 1997 .

[23]  Tamara G. Kolda,et al.  Cross-language information retrieval using PARAFAC2 , 2007, KDD '07.

[24]  Michael J. Brusco,et al.  Principal Cluster Axes: A Projection Pursuit Index for the Preservation of Cluster Structures in the Presence of Data Reduction , 2012, Multivariate behavioral research.

[25]  Min Xiao,et al.  A Novel Two-Step Method for Cross Language Representation Learning , 2013, NIPS.

[26]  Benno Stein,et al.  Cross-language plagiarism detection , 2011, Lang. Resour. Evaluation.

[27]  Carol Peters,et al.  Multilingual Information Retrieval , 2012, Springer Berlin Heidelberg.