A novel string distance metric for ranking Persian respelling suggestions

Spelling errors in digital documents are often caused by operational and cognitive mistakes, or by the lack of full knowledge about the language of the written documents. Computerassisted solutions can help to detect and suggest replacements. In this paper, we present a new string distance metric for the Persian language to rank respelling suggestions of a misspelled Persian word by considering the effects of keyboard layout on typographical spelling errors as well as the homomorphic and homophonic aspects of words for orthographical misspellings. We also consider the misspellings caused by disregarded diacritics. Since the proposed string distance metric is custom-designed for the Persian language, we present the spelling aspects of the Persian language such as homomorphs, homophones, and diacritics. We then present our statistical analysis of a set of large Persian corpora to identify the causes and the types of Persian spelling errors. We show that the proposed string distance metric has a higher mean average precision and a higher mean reciprocal rank in ranking respelling candidates of Persian misspellings in comparison with other metrics such as the Hamming, Levenshtein, Damerau–Levenshtein, Wagner–Fischer, and Jaro–Winkler metrics.

[1]  Mehrnoush Shamsfard,et al.  STeP-1: A Set of Fundamental Tools for Persian Text Processing , 2010, LREC.

[2]  Linda G. Means,et al.  Cn Yur Cmputr Raed Ths? , 1988, ANLP.

[3]  William E. Winkler,et al.  AN APPLICATION OF THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE TO THE 1990 U.S. DECENNIAL CENSUS , 1987 .

[4]  Antonio Zamora,et al.  Automatic spelling correction in scientific and scholarly text , 1984, CACM.

[5]  Cyril N. Alberga,et al.  String similarity and misspellings , 1967, CACM.

[6]  Peter Willett,et al.  Automatic Spelling Correction Using a Trigram Similarity Measure , 1983, Inf. Process. Manag..

[7]  Jukka Korpela Unicode Explained , 2006 .

[8]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[9]  Matthew A. Jaro,et al.  Probabilistic linkage of large public health data files. , 1995, Statistics in medicine.

[10]  Masoud Rahgozar,et al.  Hamshahri: A standard Persian text collection , 2009, Knowl. Based Syst..

[11]  Mohammad Sadegh Rasooli,et al.  Effect of adaptive spell checking in Persian , 2011, 2011 7th International Conference on Natural Language Processing and Knowledge Engineering.

[12]  Koenraad De Smedt,et al.  Triphone Analysis: A Combined Method for the Correction of Orthographical and Typographical Errors. , 1988, ANLP.

[13]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[14]  Allen R. Hanson,et al.  Context in word recognition , 1976, Pattern Recognition.

[15]  Roger Mitton,et al.  Spelling checkers, spelling correctors and the misspellings of poor spellers , 1987, Inf. Process. Manag..

[16]  Tomasz Janowski,et al.  Developing a Spell-Checker for Tajik Using RAISE , 2002, ICFEM.

[17]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[18]  Kristina Toutanova,et al.  Pronunciation Modeling for Improved Spelling Correction , 2002, ACL.

[19]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[20]  Khaled Shaalan,et al.  Towards automatic spell checking for Arabic , 2003 .

[21]  Mike Paterson,et al.  A Faster Algorithm Computing String Edit Distances , 1980, J. Comput. Syst. Sci..

[22]  W. W. Bledsoe,et al.  Pattern recognition and reading by machine , 1959, IRE-AIEE-ACM '59 (Eastern).

[23]  James L. Peterson,et al.  Computer programs for detecting and correcting spelling errors , 1980, CACM.

[24]  Julian R. Ullmann,et al.  A Binary n-Gram Technique for Automatic Correction of Substitution, Deletion, Insertion and Reversal Errors in Words , 1977, Comput. J..

[25]  T. Salthouse Perceptual, cognitive, and motoric aspects of transcription typing. , 1986, Psychological bulletin.

[26]  Bruce W. Ballard,et al.  Proceedings of the second conference on Applied natural language processing , 1988 .

[27]  Andrew R. Golding,et al.  A Bayesian Hybrid Method for Context-sensitive Spelling Correction , 1996, VLC@ACL.

[28]  O. Bø Types of Orthographic Error , 1973 .

[29]  R. Stauffer Chapter III: Research in Spelling and Handwriting , 1949 .

[30]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[31]  David O. Holmes,et al.  Improving precision and recall for Soundex retrieval , 2002, Proceedings. International Conference on Information Technology: Coding and Computing.

[32]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[33]  W. John Wilbur,et al.  Non-word identification or spell checking without a dictionary , 2004, J. Assoc. Inf. Sci. Technol..

[34]  Emmanuel J. Yannakoudakis,et al.  The rules of spelling errors , 1983, Inf. Process. Manag..

[35]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[36]  Victoria J. Hodge,et al.  An Evaluation of Phonetic Spell Checkers , 2001 .

[37]  Sarmad Hussain,et al.  A novel approach for ranking spelling error corrections for Urdu , 2007, Lang. Resour. Evaluation.

[38]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[39]  Antonio Zamora,et al.  Collection and characterization of spelling errors in scientific and scholarly text , 1983, J. Am. Soc. Inf. Sci..

[40]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[41]  J. Worthy,et al.  Morphological, phonological, and orthographic differences between the spelling of normally achieving children and basic literacy adults , 1996 .

[42]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[43]  Roger Mitton Ordering the suggestions of a spellchecker without using context , 2009, Nat. Lang. Eng..

[44]  Justin Zobel,et al.  Phonetic string matching: lessons from information retrieval , 1996, SIGIR '96.

[45]  Victoria J. Hodge,et al.  A Comparison of Standard Spell Checking Algorithms and a Novel Binary Neural Approach , 2003, IEEE Trans. Knowl. Data Eng..

[46]  S Abramovici,et al.  Errors in proofreading: Evidence for syntactic control of letter processing? , 1983, Memory & cognition.

[47]  Carina Silberer,et al.  Proceedings of the International Conference on Language Resources and Evaluation (LREC) , 2008 .

[48]  Gilbert Lazard,et al.  A grammar of contemporary Persian , 1994 .

[49]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[50]  Paul Douglas,et al.  International Conference on Information Technology : Coding and Computing , 2003 .

[51]  Karine Megerdoomian,et al.  Unification-Based Persian Morphology , 1999 .

[52]  Ali Farghaly,et al.  Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages , 2004 .

[53]  Euclid,et al.  The thirteen books of Euclid's Elements, Vol 1 Books 1-2 , 1908 .

[54]  T. de Heer The application of the concept of homeosemy to natural language information retrieval , 1982, Inf. Process. Manag..

[55]  Karine Megerdoomian,et al.  Finite-State Morphological Analysis of Persian , 2004 .

[56]  James L. Peterson,et al.  A note on undetected typing errors , 1986, CACM.

[57]  C. M. Eastman,et al.  On the Need for Parsing Ill-Formed Input , 1981, CL.

[58]  Yoo-Jin Moon,et al.  Typographical and Orthographical Spelling Error Correction , 2000, LREC.

[59]  R. Stauffer Research in Spelling and Handwriting , 1949 .

[60]  C M Sterling,et al.  Spelling errors in context. , 1983, British journal of psychology.

[61]  Dan Roth,et al.  A Winnow-Based Approach to Context-Sensitive Spelling Correction , 1998, Machine Learning.

[62]  James L. Peterson,et al.  Computer programs for spelling correction , 1980 .

[63]  Ellen M. Voorhees,et al.  The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text , 2000, Information Retrieval.