论文信息 - Phonetic Normalization for Machine Translation of User Generated Content

Phonetic Normalization for Machine Translation of User Generated Content

We present an approach to correct noisy User Generated Content (UGC) in French aiming to produce a pretreatement pipeline to improve Machine Translation for this kind of non-canonical corpora. In order to do so, we have implemented a character-based neural model phonetizer to produce IPA pronunciations of words. In this way, we intend to correct grammar, vocabulary and accentuation errors often present in noisy UGC corpora. Our method leverages on the fact that some errors are due to confusion induced by words with similar pronunciation which can be corrected using a phonetic look-up table to produce normalization candidates. These potential corrections are then encoded in a lattice and ranked using a language model to output the most probable corrected phrase. Compare to using other phonetizers, our method boosts a transformer-based machine translation system on UGC.

Guillaume Wisniewski | Djamé Seddah | José Carlos Rosales Núñez

[1] Benoît Sagot,et al. SxPipe 2: architecture pour le traitement pré-syntaxique de corpus bruts , 2008 .

[2] Jörg Tiedemann,et al. OpenSubtitles2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora , 2018, LREC.

[3] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[4] Benoît Sagot,et al. The French Social Media Bank: a Treebank of Noisy User Generated Content , 2012, COLING.

[5] Sara Stymne,et al. Spell Checking Techniques for Replacement of Unknown Words and Data Cleaning for Haitian Creole SMS Translation , 2011, WMT@EMNLP.

[6] Tomas Mikolov,et al. Enriching Word Vectors with Subword Information , 2016, TACL.

[7] François Yvon,et al. Reassessing the value of resources for cross-lingual transfer of POS tagging models , 2017, Lang. Resour. Evaluation.

[8] Rongrong Ji,et al. Lattice-Based Recurrent Neural Network Encoders for Neural Machine Translation , 2016, AAAI.

[9] Nizar Habash,et al. Utilizing Character and Word Embeddings for Text Normalization with Sequence-to-Sequence Models , 2018, EMNLP.

[10] Matt Post,et al. A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[11] Karin M. Verspoor,et al. Findings of the 2016 Conference on Machine Translation , 2016, WMT.