Social media text normalization for Turkish

Text normalization is an indispensable stage in processing noncanonical language from natural sources, such as speech, social media or short text messages. Research in this field is very recent and mostly on English. As is known from different areas of natural language processing, morphologically rich languages (MRLs) pose many different challenges when compared to English. Turkish is a strong representative of MRLs and has particular normalization problems that may not be easily solved by a single-stage pure statistical model. This article introduces the first work on the social media text normalization of an MRL and presents the first complete social media text normalization system for Turkish. The article conducts an in-depth analysis of the error types encountered in Web 2.0 Turkish texts, categorizes them into seven groups and provides solutions for each of them by dividing the candidate generation task into separate modules working in a cascaded architecture. For the first time in the literature, two manually normalized Web 2.0 datasets are introduced for Turkish normalization studies. The exact match scores of the overall system on the provided datasets are 70.40 per cent and 67.37 per cent (77.07 per cent with a case insensitive evaluation).

[1]  Ya'akov Gal An HMM Approach to Vowel Restoration in Arabic and Hebrew , 2002, SEMITIC@ACL.

[2]  Tyler Baldwin,et al.  Adaptive Parser-Centric Text Normalization , 2013, ACL.

[3]  Dmitry Supranovich,et al.  IHS_RD: Lexical Normalization for English Tweets , 2015, NUT@IJCNLP.

[4]  François Yvon,et al.  Normalizing SMS: are Two Metaphors Better than One ? , 2008, COLING.

[5]  Shay B. Cohen,et al.  Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing , 2015 .

[6]  Tommi A. Pirinen,et al.  State-of-the-Art in Weighted Finite-State Spell-Checking , 2014, CICLing.

[7]  Fei Liu,et al.  A Broad-Coverage Normalization System for Social Media Language , 2012, ACL.

[8]  Miikka Silfverberg,et al.  Data-Driven Spelling Correction using Weighted Finite-State Methods , 2016, ACL 2016.

[9]  Tyler Baldwin,et al.  An In-depth Analysis of the Effect of Text Normalization in Social Media , 2015, HLT-NAACL.

[10]  Noah A. Smith,et al.  Proceedings of EMNLP , 2007 .

[11]  V. Deved,et al.  Proceedings of the 24th IASTED international conference on Artificial intelligence and applications , 2006 .

[12]  Xu-Dong Zhang,et al.  Learning to Rank from Noisy Data , 2015, ACM Trans. Intell. Syst. Technol..

[13]  Yi Yang,et al.  A Log-Linear Model for Unsupervised Text Normalization , 2013, EMNLP.

[14]  William Labov,et al.  A Study of Non-Standard English. , 1969 .

[15]  Partha Pratim Talukdar,et al.  Hindi Text Normalization , 2022 .

[16]  Yang Liu,et al.  Improving Text Normalization via Unsupervised Model and Discriminative Reranking , 2014, ACL.

[17]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[18]  Alexander Mehler,et al.  A Comparison of Four Character-Level String-to-String Translation Models for (OCR) Spelling Error Correction , 2016, Prague Bull. Math. Linguistics.

[19]  Ning Jin NCSU-SAS-Ning: Candidate Generation and Feature Engineering for Supervised Lexical Normalization , 2015, NUT@IJCNLP.

[20]  Gábor Berend,et al.  USZEGED: Correction Type-sensitive Normalization of English Tweets Using Efficiently Indexed n-gram Statistics , 2015, NUT@IJCNLP.

[21]  Miguel A. Alonso,et al.  Prototipado Rápido de un Sistema de Normalización de Tuits: Una Aproximación Léxica , 2013, Tweet-Norm@SEPLN.

[22]  Deniz Yuret,et al.  The Greedy Prepend Algorithm for Decision List Induction , 2006, ISCIS.

[23]  Kenji Araki,et al.  Text Normalization in Social Media: Progress, Problems and Applications for a Pre-Processing System of Casual English , 2011 .

[24]  Samuel Paul Leeman-Munk Morphosyntactic Neural Analysis for Generalized Lexical Normalization. , 2016 .

[25]  Chung-Hsien Wu,et al.  Proceedings of the 13th Pacific Asia Conference on Language, Information and Computation, Taipei, Taiwan, February 10-11, 1999 , 1999, PACLIC.

[26]  Yue Zhang,et al.  A Transition-based Model for Joint Segmentation, POS-tagging and Normalization , 2015, EMNLP.

[27]  Maria das Graças Volpe Nunes,et al.  A Normalizer for UGC in Brazilian Portuguese , 2015, NUT@IJCNLP.

[28]  Eşref Adalı,et al.  AN AFFIX STRIPPING MORPHOLOGICAL ANALYZER FOR TURKISH , 2003 .

[29]  Asif Ekbal,et al.  IITP: Multiobjective Differential Evolution based Twitter Named Entity Recognition , 2015, NUT@IJCNLP.

[30]  Jonathan Ginzburg,et al.  Proceedings of COLING 2004 , 2004 .

[31]  Arkaitz Zubiaga,et al.  Introducción a la Tarea Compartida Tweet-Norm 2013: Normalización Léxica de Tuits en Español , 2013, Tweet-Norm@SEPLN.

[32]  Vivek Kumar Rangarajan Sridhar Unsupervised Text Normalization Using Distributed Representations of Words and Phrases , 2015, VS@HLT-NAACL.

[33]  Daiana Azamat Statistical Morphological Disambiguation for Kazakh Language , 2016 .

[34]  Gülsen Eryigit ITU Treebank Annotation Tool , 2007, LAW@ACL.

[35]  Joachim Wagner,et al.  DCU-ADAPT: Learning Edit Operations for Microblog Normalisation with the Generalised Perceptron , 2015, NUT@IJCNLP.

[36]  Olivier Buffet,et al.  A Closer Look at MOMDPs , 2010, 2010 22nd IEEE International Conference on Tools with Artificial Intelligence.

[37]  Kübra Adali,et al.  Vowel and Diacritic Restoration for Social Media Texts , 2014 .

[38]  Franck Thollard,et al.  Proceedings of COLING , 2004 .

[39]  Trevor Cohn,et al.  Logarithmic Opinion Pools for Conditional Random Fields , 2005, ACL.

[40]  Ziqi Wang,et al.  A Fast and Accurate Method for Approximate String Search , 2011, ACL.

[41]  Harsh Jhamtani,et al.  Word-level Language Identification in Bi-lingual Code-switched Texts , 2014, PACLIC.

[42]  Suzanne Stevenson,et al.  An Unsupervised Model for Text Message Normalization , 2009 .

[43]  Marta R. Costa-jussà,et al.  Selection of correction candidates for the normalization of Spanish user-generated content , 2016, Nat. Lang. Eng..

[44]  Eiríkur Rögnvaldsson,et al.  Context-Sensitive Spelling Correction and Rich Morphology , 2009, NODALIDA.

[45]  Bradford W. Mott,et al.  NCSU_SAS_WOOKHEE: A Deep Contextual Long-Short Term Memory Model for Text Normalization , 2015, NUT@IJCNLP.

[46]  Arul Menezes,et al.  Social Text Normalization using Contextual Graph Random Walks , 2013, ACL.

[47]  Fredric C. Gey,et al.  Proceedings of LREC , 2010 .

[48]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[49]  Gökhan Akın Åžeker,et al.  Initial Explorations on using CRFs for Turkish Named Entity Recognition , 2012, Coling 2012.

[50]  Chin-Hui Lee,et al.  Tweet Normalization with Syllables , 2015, ACL.

[51]  Gülşen Eryiğit,et al.  Redefinition of Turkish Morphology Using Flag Diacritics , 2013 .

[52]  Guillaume Wisniewski,et al.  Mining Naturally-occurring Corrections and Paraphrases from Wikipedia’s Revision History , 2022, LREC.

[53]  Hwee Tou Ng,et al.  A Beam-Search Decoder for Normalization of Social Media Text with Application to Machine Translation , 2013, HLT-NAACL.

[54]  Kemal Oflazer,et al.  Error-tolerant Finite-state Recognition with Applications to Morphological Analysis and Spelling Correction , 1995, CL.

[55]  David Crystal,et al.  Txtng: the Gr8 Db8 , 2008 .

[56]  Max Kaufmann Syntactic Normalization of Twitter Messages , 2010 .

[57]  Frank R. Abate,et al.  The new Oxford American dictionary , 2001 .

[58]  Shay B. Cohen,et al.  Proceedings of ACL , 2013 .

[59]  Xuanjing Huang,et al.  Chinese-English mixed text normalization , 2014, WSDM.

[60]  Yang Liu,et al.  A Character-Level Machine Translation Approach for Normalization of SMS Abbreviations , 2011, IJCNLP.

[61]  Ibon Saratxaga,et al.  Detection of synthetic speech for the problem of imposture , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[62]  Arkaitz Zubiaga,et al.  TweetNorm: a benchmark for lexical normalization of Spanish tweets , 2015, Lang. Resour. Evaluation.

[63]  Huynh Quyet Thang,et al.  Proceedings of the 2010 Symposium on Information and Communication Technology, SoICT 2010, Hanoi, Viet Nam, August 27-28, 2010 , 2010, SoICT.

[64]  Walter Daelemans,et al.  Multimodular Text Normalization of Dutch User-Generated Content , 2016, ACM Trans. Intell. Syst. Technol..

[65]  Shiwen Yu,et al.  Text normalization in mandarin text-to-speech system , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[66]  Alicia Ageno,et al.  The TALP-UPC Approach to Tweet-Norm 2013 , 2013, Tweet-Norm@SEPLN.

[67]  Gülsen Eryigit,et al.  Extending a CRF-based named entity recognition model for Turkish well formed text and user generated content , 2017, Semantic Web.

[68]  Tommi A. Pirinen,et al.  Finite-State Spell-Checking with Weighted Language and Error Models : Building and Evaluating Spell-Checkers with Wikipedia as Corpus , 2010 .

[69]  Norisma Idris,et al.  An architecture for Malay Tweet normalization , 2014, Inf. Process. Manag..

[70]  F. Pellegrino,et al.  A Quantitative and Typological Approach to Correlating Linguistic Complexity , 2013 .

[71]  Bhaarat Pachori Context Aware Sentiment Analysis , 2018 .

[72]  Alexander F. Gelbukh Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing , 2001 .

[73]  Murat Saraclar,et al.  Resources for Turkish morphological processing , 2011, Lang. Resour. Evaluation.

[74]  Timothy Baldwin,et al.  Lexical normalization for social media text , 2013, TIST.

[75]  Jesús Vilares,et al.  LYSGROUP: Adapting a Spanish microtext normalization system to English. , 2015, NUT@IJCNLP.

[76]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[77]  Timothy Baldwin,et al.  Shared Tasks of the 2015 Workshop on Noisy User-generated Text: Twitter Lexical Normalization and Named Entity Recognition , 2015, NUT@IJCNLP.

[78]  Luc Lamontagne,et al.  Word Normalization Using Phonetic Signatures , 2016, Canadian Conference on AI.

[79]  Michael Gamon,et al.  Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM) , 2014 .

[80]  Julia Hirschberg,et al.  Overview for the First Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[81]  James C. Lester,et al.  NCSU_SAS_SAM: Deep Encoding and Reconstruction for Normalization of Noisy Text , 2015, NUT@IJCNLP.

[82]  José-Luis Sancho-Gómez,et al.  Word Normalization in Twitter Using Finite-state Transducers , 2013, Tweet-Norm@SEPLN.

[83]  Russell Beckley Bekli: A Simple Approach to Twitter Text Normalization , 2015, NUT@IJCNLP.

[84]  Dilek Z. Hakkani-Tür,et al.  Introduction to the Special Issue on Processing Morphologically Rich Languages , 2009, IEEE Trans. Speech Audio Process..

[85]  Timothy Baldwin,et al.  Automatic Detection and Language Identification of Multilingual Documents , 2014, TACL.

[86]  Gülsen Eryigit,et al.  The Annotation Process of the ITU Web Treebank , 2015, LAW@NAACL-HLT.

[87]  Do Dat Tran,et al.  A method for Vietnamese text normalization to improve the quality of speech synthesis , 2010, SoICT.

[88]  Frank Keller,et al.  Using Foreign Inclusion Detection to Improve Parsing Performance , 2007, EMNLP.

[89]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[90]  Owen Rambow,et al.  Automatically Processing Tweets from Gang-Involved Youth: Towards Detecting Loss and Aggression , 2016, COLING.

[91]  Amitava Das,et al.  Code-Mixing in Social Media Text. The Last Language Identification Frontier? , 2013, Trait. Autom. des Langues.

[92]  Mohamed Medhat Gaber,et al.  Advances in Social Media Analysis , 2015, Advances in Social Media Analysis.

[93]  Yannick Versley,et al.  Statistical Parsing of Morphologically Rich Languages (SPMRL) What, How and Whither , 2010, SPMRL@NAACL-HLT.

[94]  Ruhi Sarikaya,et al.  Maximum Entropy Based Restoration of Arabic Diacritics , 2006, ACL.

[95]  Selection of Correction Candidates for the Normalization of Spanish User Generated Content , 2014 .

[96]  Cédrick Fairon,et al.  A Hybrid Rule/Model-Based Finite-State Framework for Normalizing SMS Messages , 2010, ACL.

[97]  Nirmalie Wiratunga,et al.  Context-Aware Sentiment Analysis of Social Media , 2015, Advances in Social Media Analysis.

[98]  Jian Su,et al.  A Phrase-Based Statistical Model for SMS Text Normalization , 2006, ACL.

[99]  Michael Gamon,et al.  Proceedings of the Workshop on Language Analysis in Social Media , 2013 .

[100]  Nigel Collier,et al.  Adapting Phrase-based Machine Translation to Normalise Medical Terms in Social Media Messages , 2015, EMNLP.

[101]  OflazerKemal,et al.  A statistical information extraction system for Turkish , 2003 .

[102]  Osama Khan,et al.  A Rule-Based Model for Normalization of SMS Text , 2012, 2012 IEEE 24th International Conference on Tools with Artificial Intelligence.