Next word prediction based on the N-gram model for Kurdish Sorani and Kurmanji

Next word prediction is an input technology that simplifies the process of typing by suggesting the next word to a user to select, as typing in a conversation consumes time. A few previous studies have focused on the Kurdish language, including the use of next word prediction. However, the lack of a Kurdish text corpus presents a challenge. Moreover, the lack of a sufficient number of N-grams for the Kurdish language, for instance, five grams, is the reason for the rare use of next Kurdish word prediction. Furthermore, the improper display of several Kurdish letters in the Rstudio software is another problem. This paper provides a Kurdish corpus, creates five, and presents a unique research work on next word prediction for Kurdish Sorani and Kurmanji. The N-gram model has been used for next word prediction to reduce the amount of time while typing in the Kurdish language. In addition, little work has been conducted on next Kurdish word prediction; thus, the N-gram model is utilized to suggest text accurately. To do so, R programming and RStudio are used to build the application. The model is 96.3% accurate.

[1]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[2]  Tarik A. Rashid,et al.  Automatic Kurdish Text Classification Using KDC 4007 Dataset , 2017, EIDWT.

[3]  Hiram Calvo,et al.  CoNLL 2014 Shared Task: Grammatical Error Correction with a Syntactic N-gram Language Model from a Big Corpora , 2014, CoNLL Shared Task.

[4]  Derwin Suhartono,et al.  Predictive Text System for Bahasa with Frequency, n-gram, Probability Table and Syntactic using Grammar , 2014, ICAART.

[5]  D. N. MacKenzie,et al.  Kurdish dialect studies , 1961 .

[6]  Tarik A. Rashid,et al.  Kurdish stemmer pre-processing steps for improving information retrieval , 2018, J. Inf. Sci..

[7]  Sina Ahmadi,et al.  Building a Lemmatizer and a Spell-checker for Sorani Kurdish , 2018, ArXiv.

[8]  Kyumars Sheykh Esmaili,et al.  Towards Kurdish Information Retrieval , 2014, ACM Trans. Asian Lang. Inf. Process..

[9]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[10]  M. Hanumanthappa,et al.  N-gram Word prediction language models to identify the sequence of article blocks in English e-newspapers , 2016, 2016 International Conference on Computation System and Information Technology for Sustainable Solutions (CSITSS).

[11]  Tarik A. Rashid,et al.  An evaluation of Reber stemmer with longest match stemmer technique in Kurdish Sorani text classification , 2018 .

[12]  Fardin Akhlaghian,et al.  Stemming for Kurdish Information Retrieval , 2013, AIRS.

[13]  Tarik A. Rashid,et al.  A Robust Categorization System for Kurdish Sorani Text Documents , 2016 .

[14]  Kyumars Sheykh Esmaili,et al.  Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison , 2013, ACL.

[15]  Md. Mokhlesur Rahman,et al.  Automated Word Prediction in Bangla Language Using Stochastic Language Models , 2016, ArXiv.