论文信息 - SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings

SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings

Word alignments are useful for tasks like statistical and neural machine translation (NMT) and annotation projection. Statistical word aligners perform well, as do methods that extract alignments jointly with translations in NMT. However, most approaches require parallel training data and quality decreases as less training data is available. We propose word alignment methods that require no parallel data. The key idea is to leverage multilingual word embeddings, both static and contextualized, for word alignment. Our multilingual embeddings are created from monolingual data only without relying on any parallel data or dictionaries. We find that alignments created from embeddings are competitive and mostly superior to traditional statistical aligners, even in scenarios with abundant parallel data. For example, for a set of 100k parallel sentences, contextualized embeddings achieve a word alignment F1 for English-German that is more than 5% higher (absolute) than eflomal, a high quality alignment model.

Hinrich Schutze | Philipp Dufter | Masoud Jalili Sabet

[1] Vishrav Chaudhary,et al. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[2] Kevin Knight,et al. Using Word Vectors to Improve Word Alignments for Low Resource Machine Translation , 2018, NAACL.

[3] Hinrich Schütze,et al. Past, Present, Future: A Computational Investigation of the Typology of Tense in 1000 Languages , 2017, EMNLP.

[4] Harold W. Kuhn,et al. The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[5] Noah A. Smith,et al. You May Not Need Attention , 2018, ArXiv.

[6] Guillaume Lample,et al. Word Translation Without Parallel Data , 2017, ICLR.

[7] David Yarowsky,et al. A Representation Learning Framework for Multi-Source Transfer Parsing , 2016, AAAI.

[8] Philip Resnik,et al. Evaluating Translational Correspondence using Annotation Projection , 2002, ACL.

[9] Hermann Ney,et al. Biasing Attention-Based Recurrent Neural Networks Using External Alignment Information , 2017, WMT.

[10] John DeNero,et al. Adding Interpretable Attention to Neural Translation Models Improves Word Alignment , 2019, ArXiv.

[11] Wiebke Wagner,et al. Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[12] Matt Post,et al. A Discriminative Neural Model for Cross-Lingual Word Alignment , 2019, EMNLP.

[13] Mike Schuster,et al. Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Philipp Koehn,et al. Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[15] Hermann Ney,et al. On The Alignment Problem In Multi-Head Attention-Based Neural Machine Translation , 2018, WMT.

[16] Guillaume Lample,et al. Massively Multilingual Word Embeddings , 2016, ArXiv.

[17] Robert E. Tarjan,et al. On Minimum-Cost Assignments in Unbalanced Bipartite Graphs , 2012 .

[18] Tomas Mikolov,et al. Enriching Word Vectors with Subword Information , 2016, TACL.

[19] Lemao Liu,et al. On the Word Alignment from Neural Machine Translation , 2019, ACL.

[20] Holger Schwenk,et al. WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia , 2019, EACL.

[21] Hermann Ney,et al. Improved Statistical Alignment Models , 2000, ACL.

[22] I. Dan Melamed,et al. Models of translation equivalence among words , 2000, CL.

[23] Alexander M. Fraser,et al. How Language-Neutral is Multilingual BERT? , 2019, ArXiv.

[24] Alexander M. Fraser,et al. Cross-lingual Annotation Projection Is Effective for Neural Part-of-Speech Tagging , 2019, Proceedings of the Sixth Workshop on.

[25] Vasileios Hatzivassiloglou,et al. Translating Collocations for Bilingual Lexicons: A Statistical Approach , 1996, CL.