An Unsupervised Text Normalization Architecture for Turkish Language

A variety of applications on the problem of short-text messages require text normalization process that transforms ill-formed words into standard ones. Recently, many successful approaches have been applied to text normalization especially for social media text. Since each natural language has its own difficulties and barriers, we need to design an architecture to normalize short text messages in Turkish language which has an morphologically rich agglutinative structure. The model proceeds from simple solutions towards more complicated and sophisticated ones to reduce time complexity. A variety of techniques from lexical similarity to n-gram language modeling have been evaluated by exploiting several resources such as high quality corpus, morphological parser and dictionaries. We demonstrate that unsupervised text normalization architecture adapting both lexical and semantic similarity for Turkish domain has shown efficient results that might contribute to other studies.

[1]  Yang Liu,et al.  A Character-Level Machine Translation Approach for Normalization of SMS Abbreviations , 2011, IJCNLP.

[2]  Pablo Ruiz,et al.  Lexical Normalization of Spanish Tweets with Rule-Based Components and Language Models , 2014, Proces. del Leng. Natural.

[3]  Cédrick Fairon,et al.  A Hybrid Rule/Model-Based Finite-State Framework for Normalizing SMS Messages , 2010, ACL.

[4]  Jian Su,et al.  A Phrase-Based Statistical Model for SMS Text Normalization , 2006, ACL.

[5]  Arul Menezes,et al.  Social Text Normalization using Contextual Graph Random Walks , 2013, ACL.

[6]  Ebru Arisoy,et al.  Morph-based speech recognition and modeling of out-of-vocabulary words across languages , 2007, TSLP.

[7]  Manfred Stede,et al.  Rule-Based Normalization of German Twitter Messages , 2013 .

[8]  Fei Liu,et al.  A Broad-Coverage Normalization System for Social Media Language , 2012, ACL.

[9]  Suzanne Stevenson,et al.  An Unsupervised Model for Text Message Normalization , 2009 .

[10]  Timothy Baldwin,et al.  Automatically Constructing a Normalisation Dictionary for Microblogs , 2012, EMNLP.

[11]  Tyler Baldwin,et al.  Adaptive Parser-Centric Text Normalization , 2013, ACL.

[12]  Norisma Idris,et al.  An architecture for Malay Tweet normalization , 2014, Inf. Process. Manag..

[13]  Kenji Araki,et al.  Text Normalization in Social Media: Progress, Problems and Applications for a Pre-Processing System of Casual English , 2011 .

[14]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[15]  José-Luis Sancho-Gómez,et al.  Word Normalization in Twitter Using Finite-state Transducers , 2013, Tweet-Norm@SEPLN.

[16]  KurimoMikko,et al.  Morph-based speech recognition and modeling of out-of-vocabulary words across languages , 2007 .

[17]  François Yvon,et al.  Normalizing SMS: are Two Metaphors Better than One ? , 2008, COLING.

[18]  Max Kaufmann Syntactic Normalization of Twitter Messages , 2010 .