论文信息 - Leveraging Data Resources for Cross-Linguistic Information Retrieval Using Statistical Machine Translation

Leveraging Data Resources for Cross-Linguistic Information Retrieval Using Statistical Machine Translation

Retail websites may provide customers with a localized user experience by allowing them to use a secondary language of preference. Automatic translation of user search queries is a crucial component of this experience. Several domain-adapted SMT systems for search query translation were trained, including language pairs for which smaller-than desired parallel resources were available, such as Polish-German and Chinese-Japanese. We explored several techniques that could be used to optimize MT systems for this use-case. These included specialized forms of pre-processing, such as diacritic normalization and a weak form of language filtering, using byte-pair encoding (BPE) for automatic word segmentation, sampling monolingual query data for use as an LM, and pivoting. To help measure the impact of these techniques, we also introduced normalized distributed cumulative gain for machine translation (NDCG-MT) as a means to measure the success of our MT system at the downstream information retrieval task. In addition to examining how close our translation is to a human-generated one, we measured the similarity in search results between reference and machine-translated queries. One additional challenge was the difficulty in choosing a representative sample of user search queries to use as tuning and test data. The most popular search queries may occur significantly more frequently and could include vocabulary likely to be well-covered by the rest of the training data. Consequently, we will also discuss techniques that can be used to optimize selection of tune/test data. In general, we suggest assessing MT performance on both “head queries,” those that occur most frequently, and “tail queries,” less frequent queries that could be used to evaluate performance on difficult inputs.

[1] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[2] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[3] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[4] Carmen Heger,et al. Machine translation for global e-commerce on eBay , 2014, AMTA.

[5] Pushpak Bhattacharyya,et al. Learning variable length units for SMT between related languages via Byte Pair Encoding , 2016, SWCN@EMNLP.

[6] Christof Monz,et al. Adaptation of Statistical Machine Translation Model for Cross-Lingual Information Retrieval in a Service Context , 2012, EACL.

[7] Jörg Tiedemann,et al. The Helsinki Neural Machine Translation System , 2017, WMT.