Vowel and Diacritic Restoration for Social Media Texts

In this paper, we focus on two important problems of social media text normalization, namely: vowel and diacritic restoration. For these two problems, we propose a hybrid model consisting both a discriminative sequence classifier and a language validator in order to select one of the morphologically valid outputs of the first stage. The proposed model is language independent and has no need for manual annotation of the training data. We measured the performance both on synthetic data specifically produced for these two problems and on real social media data. Our model (with 97.06% on synthetic data) improves the state of the art results for diacritization of Turkish by 3.65 percentage points on ambiguous cases and for the vowel restoration by 45.77 percentage points over a rule based baseline with 62.66% accuracy. The results on real data are 95.43% and 69.56% accordingly.

[1]  Gökhan Tür,et al.  A statistical information extraction system for Turkish , 2003, Natural Language Engineering.

[2]  Stuart M. Shieber,et al.  Arabic Diacritization Using Weighted Finite-State Transducers , 2005, SEMITIC@ACL.

[3]  Ya'akov Gal An HMM Approach to Vowel Restoration in Arabic and Hebrew , 2002, SEMITIC@ACL.

[4]  Eric K. Ringger,et al.  Automatic Diacritization for Low-Resource Languages Using a Hybrid Word and Consonant CMM , 2010, NAACL.

[5]  Gülşen Eryiğit,et al.  Redefinition of Turkish Morphology Using Flag Diacritics , 2013 .

[6]  Gülsen Eryigit,et al.  TURKSENT: A Sentiment Annotation Tool for Social Media , 2013, LAW@ACL.

[7]  Michel Simard,et al.  Real-time automatic insertion of accents in French text , 2001, Nat. Lang. Eng..

[8]  Thomas Hain,et al.  CRF-based Diacritisation of Colloquial Arabic for Automatic Speech Recognition , 2012, INTERSPEECH.

[9]  Murat Saraclar,et al.  Resources for Turkish morphological processing , 2011, Lang. Resour. Evaluation.

[10]  Cheol-Young Ock,et al.  Diacritics Restoration in Vietnamese: Letter Based vs. Syllable Based Model , 2010, PRICAI.

[11]  Gülsen Eryigit,et al.  ITU Turkish NLP Web Service , 2014, EACL.

[12]  Kevin P. Scannell Statistical unicodification of African languages , 2011, Lang. Resour. Evaluation.

[13]  Ruhi Sarikaya,et al.  Maximum Entropy Based Restoration of Arabic Diacritics , 2006, ACL.

[14]  Deniz Yuret,et al.  The Greedy Prepend Algorithm for Decision List Induction , 2006, ISCIS.

[15]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[16]  Gilles-Maurice de Schryver,et al.  Automatic Diacritic Restoration for Resource-Scarce Languages , 2007, TSD.