Normalization of Dutch User-Generated Content

This paper describes a phrase-based machine translation approach to normalize Dutch user-generated content (UGC). We compiled a corpus of three different social media genres (text messages, message board posts and tweets) to have a sample of this recent domain. We describe the various characteristics of this noisy text material and explain how it has been manually normalized using newly developed guidelines. For the automatic normalization task we focus on text messages, and find that a cascaded SMT system where a token-based module is followed by a translation at the character level gives the best word error rate reduction. After these initial experiments, we investigate the system’s robustness on the complete domain of UGC by testing it on the other two social media genres, and find that the cascaded approach performs best on these genres as well. To our knowledge, we deliver the first proof-of-concept system for Dutch UGC normalization, which can serve as a baseline for future work.

[1]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[2]  F. V. Eynde for the Spoken Dutch Corpus , 2000 .

[3]  Fei Liu,et al.  Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision , 2011, ACL.

[4]  Nelleke Oostdijk,et al.  The spoken Dutch Corpus. Outline and first evaluation , 2000 .

[5]  Hermann Ney,et al.  Can We Translate Letters? , 2007, WMT@ACL.

[6]  Cédrick Fairon,et al.  A Hybrid Rule/Model-Based Finite-State Framework for Normalizing SMS Messages , 2010, ACL.

[7]  Jian Su,et al.  A Phrase-Based Statistical Model for SMS Text Normalization , 2006, ACL.

[8]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[9]  Ming Zhou,et al.  Recognizing Named Entities in Tweets , 2011, ACL.

[10]  Brian D. Davison,et al.  Normalizing Microtext , 2011, Analyzing Microtext.

[11]  Animesh Mukherjee,et al.  Investigation and modeling of the structure of texting language , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[12]  Fei Liu,et al.  A Broad-Coverage Normalization System for Social Media Language , 2012, ACL.

[13]  William J. Byrne,et al.  A Generative Probabilistic OCR Model for NLP Applications , 2003, NAACL.

[14]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[15]  François Yvon,et al.  Normalizing SMS: are Two Metaphors Better than One ? , 2008, COLING.

[16]  Vera Demberg,et al.  Phonological Constraints and Morphological Preprocessing for Grapheme-to-Phoneme Conversion , 2007, ACL.

[17]  Jörg Tiedemann,et al.  Character-Based Pivot Translation for Under-Resourced Languages and Domains , 2012, EACL.

[18]  Stephan Vogel,et al.  Diacritization as a Machine Translation and as a Sequence Labeling Problem , 2008, AMTA.

[19]  Roser Morante,et al.  The Netlog Corpus. A Resource for the Study of Flemish Dutch Internet Language , 2012, LREC.

[20]  Shankar Kumar,et al.  Normalization of non-standard words , 2001, Comput. Speech Lang..

[21]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[22]  Yang Liu,et al.  A Character-Level Machine Translation Approach for Normalization of SMS Abbreviations , 2011, IJCNLP.

[23]  E. Rabinovitch,et al.  The language Of The Internet , 1998, IEEE Communications Magazine.

[24]  Gosse Bouma,et al.  Essential Speech and Language Technology for Dutch , 2012 .

[25]  Josef van Genabith,et al.  #hardtoparse: POS Tagging and Parsing the Twitterverse , 2011, Analyzing Microtext.

[26]  Nelleke Oostdijk,et al.  Collection of a corpus of Dutch SMS , 2012, LREC.

[27]  Suzanne Stevenson,et al.  An Unsupervised Model for Text Message Normalization , 2009 .