FarsiSpell: A spell-checking system for Persian using a large monolingual corpus

In recent years, great availability of various language resources in different forms as well as rapid development of computer technology and programming skills have made researchers in the fields of linguistics and computer science cooperate in solving different problems of computational linguistics and natural language processing. Building large monolingual as well as bilingual corpora in digital forms and storing them in computer memories has enabled linguists and lan- guage engineers to automatically explore techniques for processing information with the help of various computer programs without any need to manually col- lect and analyze data. One of the main applications of monolingual corpora can be seen in developing automatic spell-checking systems. In such systems, a large monolingual corpus can function as a database instead of a monolingual dictionary. In the present study, it has been tried to demonstrate the effectiveness of a large monolingual corpus of Persian in improving the output quality of a spell-checker developed for this language. In the present spelling correction system, the three phases of error detection, making suggestions, and ranking suggestions are performed in three separate stages. An experiment was carried out to evaluate the performance of the spell-checking system.

[1]  Nasser Mozayani,et al.  A Persian OCR System Using Morphological Operators , 2007, WEC.

[2]  Elena M. Zamora,et al.  The use of trigram analysis for spelling error detection , 1981, Inf. Process. Manag..

[3]  Youssef Rezvan,et al.  Towards spell checking in FarsiTeX , 2006 .

[4]  Heshaam Faili,et al.  Grammatical and context‐sensitive error correction using a statistical machine translation framework , 2013, Softw. Pract. Exp..

[5]  Waldemar Passos,et al.  Sorting and Searching Algorithms , 2016 .

[6]  Behrang QasemiZadeh,et al.  CloniZER Spell Checker Adaptive, Language Independent Spell Checker , 2005 .

[7]  Ahmed Hassan Awadallah,et al.  Language Independent Text Correction using Finite State Automata , 2008, IJCNLP.

[8]  Emmanuel J. Yannakoudakis,et al.  The rules of spelling errors , 1983, Inf. Process. Manag..

[9]  Emmanuel J. Yannakoudakis,et al.  An intelligent spelling error corrector , 1983, Inf. Process. Manag..

[10]  Heshaam Faili Detection and correction of real-word spelling errors in Persian language , 2010, Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010).

[11]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[12]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[13]  B. John Oommen,et al.  A formal theory for optimal and information theoretic syntactic pattern recognition , 1998, Pattern Recognit..

[14]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[15]  Philip Resnik,et al.  OCR error correction using a noisy channel model , 2002 .

[16]  M. D. McIlroy,et al.  Development of a Spelling List , 1982, IEEE Trans. Commun..

[17]  Robert A. Wagner,et al.  Order-n correction for regular languages , 1974, CACM.

[18]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[19]  Justin Zobel,et al.  Phonetic string matching: lessons from information retrieval , 1996, SIGIR '96.

[20]  Victoria J. Hodge,et al.  A Comparison of Standard Spell Checking Algorithms and a Novel Binary Neural Approach , 2003, IEEE Trans. Knowl. Data Eng..

[21]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[22]  Roger Mitton Fifty years of spellchecking , 2010 .

[23]  Antonio Zamora,et al.  Automatic spelling correction in scientific and scholarly text , 1984, CACM.

[24]  Karine Megerdoomian,et al.  Persian Computational Morphology: A Unification-Based Approach , 2000 .

[25]  Davide Fossati,et al.  A Mixed Trigrams Approach for Context Sensitive Spell Checking , 2009, CICLing.

[26]  Yves Schabes,et al.  Combining Trigram-based and Feature-based Methods for Context-Sensitive Spelling Correction , 1996, ACL.

[27]  Julian R. Ullmann,et al.  A Binary n-Gram Technique for Automatic Correction of Substitution, Deletion, Insertion and Reversal Errors in Words , 1977, Comput. J..

[28]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.