Improving a lexicon-based spelling checker for Sesotho sa Leboa

The aim of this article is to investigate how (i) n-gram analysis and (ii) the application of grammatical rules can improve the lexical recall of the spelling checker for Sesotho sa Leboa developed by the Centre for Text Technology. North-West University in cooperation with the Department of African Languages at the University of Pretoria. It will be shown that for a disjunctively written language like Sesotho sa Leboa lexical recall exceeding 95% can be obtained by using a list of frequently occurring words. The paper will first investigate the efficiency of using grapheme-based n-gram models in the spellchecking procedure. Second, it will discuss the utilization of grammatical rules to increase lexical recall, focusing on nominal constructions such as the diminutive, locative and augmentative, and also on verbal suffixes and suffix combinations.

[1]  Victoria J. Hodge,et al.  A Novel Binary Spell Checker , 2001, ICANN.

[2]  Carlos G. Figuerola,et al.  Stemming and n-grams in Spanish: an evaluation of their impact on information retrieval , 2000, J. Inf. Sci..

[3]  D. Ziervogel,et al.  Pukuntšu ye kgolo ya sesotho sa leboa : sesotho sa leboa seburu/seisimane = Groot Noord-Sotho-woordeboek : Noord-Sotho Afrikaans/Engels = Comprehensive Northern Sotho dictionary : Northern Sotho Afrikaans/English , 1975 .

[4]  Elena M. Zamora,et al.  The use of trigram analysis for spelling error detection , 1981, Inf. Process. Manag..

[5]  A. Wrigley Parse tree n-grams for spoken language modelling , 1993 .

[6]  Andrei Popescu-Belis,et al.  Corpus-based Evaluation of a French Spelling and Grammar Checker , 2002, LREC.

[7]  S. Verberne Context-sensitive Spell Checking Based on Word Trigram Probabilities Context-sensitive Spell Checking Based on Word Trigram Probabilities , 2002 .

[8]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[9]  Gilles-Maurice de Schryver,et al.  TOWARDS AN 11 X 11 ARRAY FOR THE DEGREE OF CONJUNCTIVISM / DISJUNCTIVISM OF THE SOUTH AFRICAN LANGUAGES , 2002 .

[10]  Yves Schabes,et al.  Combining Trigram-based and Feature-based Methods for Context-Sensitive Spelling Correction , 1996, ACL.

[11]  José B. Mariño,et al.  Statistical Machine Translation of Euparl Data by using Bilingual N-grams , 2005, ParallelText@ACL.

[12]  David M. W. Powers,et al.  Applications and Explanations of Zipf’s Law , 1998, CoNLL.

[13]  Koenraad De Smedt,et al.  Triphone Analysis: A Combined Method for the Correction of Orthographical and Typographical Errors. , 1988, ANLP.

[14]  Karen Kukich,et al.  Spelling correction for the telecommunications network for the deaf , 1992, CACM.

[15]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.