Word Embedding based Semantic Cross-Lingual Document Alignment in Comparable Corpora

Crosslingual information retrieval (CLIR) finds its application in aligning documents across comparable corpora. However, traditional CLIR, due to the term independence assumption, cannot consider the semantic similarity between the constituent words of the candidate pairs of documents in two different languages. Moreover, traditional CLIR models score a document by aggregating only the weights of the constituent terms that match with those of the query, while the other non-matching terms of the document do not significantly contribute to the similarity function. Word vector embedding allows the provision to model the semantic distances between terms by the application of standard distance metrics between their corresponding real valued vectors. This paper develops a word vector embedding based CLIR model that uses the average distances between the embedded word vectors of the source and target language documents to rank candidate document pairs. Our experiments with the WMT bilingual document alignment dataset reveal that the word vector based similarity significantly improves the recall of crosslingual document alignment in comparison to the classical language modeling based CLIR.

[1]  Jimmy J. Lin,et al.  No Free Lunch: Brute Force vs. Locality-Sensitive Hashing for Cross-lingual Pairwise Similarity , 2011, SIGIR '11.

[2]  Patrice Bellot,et al.  Are Semantically Coherent Topic Models Useful for Ad Hoc Information Retrieval? , 2013, ACL.

[3]  Christopher C. Yang,et al.  Automatic construction of English/Chinese parallel corpora , 2003, J. Assoc. Inf. Sci. Technol..

[4]  Yiu-Chang Lin,et al.  YODA System for WMT16 Shared Task: Bilingual Document Alignment , 2016, WMT.

[5]  Guido Zuccon,et al.  Integrating and Evaluating Neural Word Embeddings in Information Retrieval , 2015, ADCS.

[6]  W. Bruce Croft,et al.  Embedding-based Query Language Models , 2016, ICTIR.

[7]  Ondrej Bojar,et al.  Using Term Position Similarity and Language Modeling for Bilingual Document Alignment , 2016, WMT.

[8]  John D. Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR '99.

[9]  W. Bruce Croft,et al.  A Language Modeling Approach to Information Retrieval , 1998, SIGIR Forum.

[10]  Jimmy J. Lin,et al.  Combining Statistical Translation Techniques for Cross-Language Information Retrieval , 2012, COLING.

[11]  W. Bruce Croft,et al.  Automatic query generation for patent search , 2009, CIKM.

[12]  Oren Kurland,et al.  Query Expansion Using Word Embeddings , 2016, CIKM.

[13]  José Gabriel Pereira Lopes,et al.  First Steps Towards Coverage-Based Document Alignment , 2016, WMT.

[14]  John C. Platt,et al.  Translingual Document Representations from Discriminative Projections , 2010, EMNLP.

[15]  Nick Craswell,et al.  Query Expansion with Locally-Trained Word Embeddings , 2016, ACL.

[16]  Mandar Mitra,et al.  Word Embedding based Generalized Language Model for Information Retrieval , 2015, SIGIR.

[17]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[18]  Zhenxin Yang,et al.  Building Comparable Corpora Based on Bilingual LDA Model , 2013, ACL.

[19]  Tetsuya Ishikawa,et al.  Associative document retrieval by query subtopic analysis and its application to invalidity patent search , 2004, CIKM '04.

[20]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[21]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[22]  Philipp Koehn,et al.  Quick and Reliable Document Alignment via TF/IDF-weighted Cosine Distance , 2016, WMT.

[23]  Marie-Francine Moens,et al.  Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings , 2015, SIGIR.

[24]  James Allan,et al.  A Comparative Study of Utilizing Topic Models for Information Retrieval , 2009, ECIR.

[25]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[26]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[27]  E. Voorhees The Effectiveness & Efficiency of Agglomerative Hierarchic Clustering in Document Retrieval , 1985 .

[28]  Gareth J. F. Jones,et al.  Word Vector Compositionality based Relevance Feedback using Kernel Density Estimation , 2016, CIKM.

[29]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[30]  Utpal Garain,et al.  Using Word Embeddings for Automatic Query Expansion , 2016, ArXiv.

[31]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[32]  Stephan Vogel,et al.  Adaptive parallel sentences mining from web bilingual news collection , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..