论文信息 - Word Embedding based Semantic Cross-Lingual Document Alignment in Comparable Corpora

Word Embedding based Semantic Cross-Lingual Document Alignment in Comparable Corpora

Crosslingual information retrieval (CLIR) finds its application in aligning documents across comparable corpora. However, traditional CLIR, due to the term independence assumption, cannot consider the semantic similarity between the constituent words of the candidate pairs of documents in two different languages. Moreover, traditional CLIR models score a document by aggregating only the weights of the constituent terms that match with those of the query, while the other non-matching terms of the document do not significantly contribute to the similarity function. Word vector embedding allows the provision to model the semantic distances between terms by the application of standard distance metrics between their corresponding real valued vectors. This paper develops a word vector embedding based CLIR model that uses the average distances between the embedded word vectors of the source and target language documents to rank candidate document pairs. Our experiments with the WMT bilingual document alignment dataset reveal that the word vector based similarity significantly improves the recall of crosslingual document alignment in comparison to the classical language modeling based CLIR.

Dwaipayan Roy | Debasis Ganguly | Haithem Afli

[1] Jimmy J. Lin,et al. No Free Lunch: Brute Force vs. Locality-Sensitive Hashing for Cross-lingual Pairwise Similarity , 2011, SIGIR '11.

[2] Patrice Bellot,et al. Are Semantically Coherent Topic Models Useful for Ad Hoc Information Retrieval? , 2013, ACL.

[3] Christopher C. Yang,et al. Automatic construction of English/Chinese parallel corpora , 2003, J. Assoc. Inf. Sci. Technol..

[4] Yiu-Chang Lin,et al. YODA System for WMT16 Shared Task: Bilingual Document Alignment , 2016, WMT.

[5] Guido Zuccon,et al. Integrating and Evaluating Neural Word Embeddings in Information Retrieval , 2015, ADCS.

[6] W. Bruce Croft,et al. Embedding-based Query Language Models , 2016, ICTIR.

[7] Ondrej Bojar,et al. Using Term Position Similarity and Language Modeling for Bilingual Document Alignment , 2016, WMT.

[8] John D. Lafferty,et al. Information retrieval as statistical translation , 1999, SIGIR '99.

[9] W. Bruce Croft,et al. A Language Modeling Approach to Information Retrieval , 1998, SIGIR Forum.

[10] Jimmy J. Lin,et al. Combining Statistical Translation Techniques for Cross-Language Information Retrieval , 2012, COLING.

[11] W. Bruce Croft,et al. Automatic query generation for patent search , 2009, CIKM.