Phonetic Normalization for Machine Translation of User Generated Content

We present an approach to correct noisy User Generated Content (UGC) in French aiming to produce a pretreatement pipeline to improve Machine Translation for this kind of non-canonical corpora. In order to do so, we have implemented a character-based neural model phonetizer to produce IPA pronunciations of words. In this way, we intend to correct grammar, vocabulary and accentuation errors often present in noisy UGC corpora. Our method leverages on the fact that some errors are due to confusion induced by words with similar pronunciation which can be corrected using a phonetic look-up table to produce normalization candidates. These potential corrections are then encoded in a lattice and ranked using a language model to output the most probable corrected phrase. Compare to using other phonetizers, our method boosts a transformer-based machine translation system on UGC.

[1]  Benoît Sagot,et al.  SxPipe 2: architecture pour le traitement pré-syntaxique de corpus bruts , 2008 .

[2]  Jörg Tiedemann,et al.  OpenSubtitles2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora , 2018, LREC.

[3]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[4]  Benoît Sagot,et al.  The French Social Media Bank: a Treebank of Noisy User Generated Content , 2012, COLING.

[5]  Sara Stymne,et al.  Spell Checking Techniques for Replacement of Unknown Words and Data Cleaning for Haitian Creole SMS Translation , 2011, WMT@EMNLP.

[6]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[7]  François Yvon,et al.  Reassessing the value of resources for cross-lingual transfer of POS tagging models , 2017, Lang. Resour. Evaluation.

[8]  Rongrong Ji,et al.  Lattice-Based Recurrent Neural Network Encoders for Neural Machine Translation , 2016, AAAI.

[9]  Nizar Habash,et al.  Utilizing Character and Word Embeddings for Text Normalization with Sequence-to-Sequence Models , 2018, EMNLP.

[10]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[11]  Karin M. Verspoor,et al.  Findings of the 2016 Conference on Machine Translation , 2016, WMT.

[12]  Chris Dyer,et al.  PanPhon: A Resource for Mapping IPA Segments to Articulatory Feature Vectors , 2016, COLING.

[13]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[14]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[15]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[16]  Benoît Sagot,et al.  From Noisy Questions to Minecraft Texts: Annotation Challenges in Extreme Syntax Scenario , 2016, NUT@COLING.

[17]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[18]  Matthias Sperber,et al.  Neural Lattice-to-Sequence Models for Uncertain Inputs , 2017, EMNLP.

[19]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[20]  Gertjan van Noord,et al.  Modeling Input Uncertainty in Neural Network Dependency Parsing , 2018, EMNLP.

[21]  Alexander M. Rush,et al.  OpenNMT: Neural Machine Translation Toolkit , 2018, AMTA.

[22]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[23]  Laurent Besacier,et al.  Word/sub-word lattices decomposition and combination for speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Hwee Tou Ng,et al.  A Beam-Search Decoder for Normalization of Social Media Text with Application to Machine Translation , 2013, HLT-NAACL.

[25]  Nizar Habash,et al.  The First QALB Shared Task on Automatic Text Correction for Arabic , 2014, ANLP@EMNLP.

[26]  Graham Neubig,et al.  MTNT: A Testbed for Machine Translation of Noisy Text , 2018, EMNLP.

[27]  Guillaume Wisniewski,et al.  Comparison between NMT and PBSMT Performance for Translating Noisy User-Generated Content , 2019, NODALIDA.

[28]  Alexander I. Rudnicky,et al.  System combination for out-of-vocabulary word detection , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Mehryar Mohri,et al.  Semiring Frameworks and Algorithms for Shortest-Distance Problems , 2002, J. Autom. Lang. Comb..

[30]  Brian Roark,et al.  The OpenGrm open-source finite-state grammar software libraries , 2012, ACL.