Development of the information system for the Kazakh language preprocessing

Abstract The aim of this work is the design and development of linguistic resources and preprocessing tools for the Kazakh language. The media-corpus of the Kazakh language is presented as a linguistic resource, which is available on Al-Farabi Kazakh National University platform. The media-corpus of the Kazakh language consists of texts of news content and is implemented as an information system. The general architecture of an information system for the automatic and reliable collection, storage and analysis of texts in the Kazakh language is described. Three automatic text preprocessing tools for the Kazakh language – word forms generator, morphological analyzer, and morphological disambiguation tool – are presented in the article. The proposed tools can also be applied in the systems of automatic analysis of texts, in creation of other linguistic resources such as thesauri and ontologies.

[1]  Francis M. Tyers,et al.  A free/open-source hybrid morphological disambiguation tool for Kazakh , 2016 .

[2]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[3]  Đorđe Petrović,et al.  THE INFLUENCE OF TEXT PREPROCESSING METHODS AND TOOLS ON CALCULATING TEXT SIMILARITY , 2019, Facta Universitatis, Series: Mathematics and Informatics.

[4]  Nayer M. Wanas,et al.  A Study of Text Preprocessing Tools for Arabic Text Categorization , 2009 .

[5]  Murat Saraclar,et al.  Turkish Language Resources: Morphological Parser, Morphological Disambiguator and Web Corpus , 2008, GoTAL.

[6]  Ualsher Tukeyev,et al.  The Solution of the Problem of Unknown Words Under Neural Machine Translation of the Kazakh Language , 2020, ACIIDS.

[7]  Zhanibek Kozhirbayev,et al.  Extended language modeling experiments for Kazakh , 2018 .

[8]  Carlos G'omez-Rodr'iguez,et al.  Cross-Lingual Word Embeddings for Turkic Languages , 2020, LREC.

[9]  Guan Le,et al.  Survey on NoSQL database , 2011, 2011 6th International Conference on Pervasive Computing and Applications.

[10]  Volkan Tunali,et al.  PRETO: a high-performance text mining tool for preprocessing Turkish texts , 2012, CompSysTech '12.

[11]  Niteshwar Datt Bhardwaj Comparative Study of CouchDB and MongoDB – NoSQL Document Oriented Databases , 2016 .

[12]  Jaroslav Pokorný,et al.  How to Store and Process Big Data: Are Today's Databases Sufficient? , 2014, CISIM.

[13]  Gülşen Eryiğit,et al.  Implementing universal dependency, morphology, and multiword expression annotation standards for Turkish language processing , 2018 .

[14]  GÜLŞEN ERYİǦİT,et al.  Social media text normalization for Turkish , 2017, Natural Language Engineering.

[15]  Cihat Eryigit,et al.  Building the first comprehensive machine-readable Turkish sign language resource: methods, challenges and solutions , 2020, Lang. Resour. Evaluation.

[16]  Carlos Ramisch,et al.  Survey: Multiword Expression Processing: A Survey , 2017, CL.

[17]  Aibek Makazhanov,et al.  Assembling the Kazakh Language Corpus , 2013, EMNLP.

[18]  Esref Adali,et al.  A Uniform Morphological Analyzer for the Kazakh and Turkish Languages , 2017, AIST.

[19]  Madina Mansurova,et al.  Design and Development of Media-Corpus of the Kazakh Language , 2017, ICCCI.

[20]  Abraham Kaplan,et al.  An experimental study of ambiguity and context , 1955, Mech. Transl. Comput. Linguistics.