"UTTAM": An Efficient Spelling Correction System for Hindi Language Based on Supervised Learning

In this article, we propose a system called “UTTAM,” for correcting spelling errors in Hindi language text using supervised learning. Unlike other languages, Hindi contains a large set of characters, words with inflections and complex characters, phonetically similar sets of characters, and so on. The complexity increases the possibility of confusion and occasionally leads to entering a wrong character in a word. The existence of spelling errors in text significantly decreases the accuracy of the available resources, like search engine, text editor, and so on. The proposed work is the first approach to correct non-word (Out of Vocabulary) errors as well as real-word errors simultaneously in a sentence of Hindi language. The proposed method investigates the human behavior, i.e., the type and frequency of spelling errors done by humans in Hindi text. Based on the type and frequency of spelling errors, the heterogeneous data is collected in matrices. This data in matrices is used to generate the suitable candidate words for an input word. After generating candidate words, the Viterbi algorithm is applied to perform the word correction. The Viterbi algorithm finds the best sequence of candidate words to correct the input sentence. For Hindi, this work is the first attempt for real-word error correction. For non-word errors, the experiments show that “UTTAM” performs better than the existing systems SpellGuru and Saksham.

[1]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[2]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[3]  Ritika Agarwal,et al.  Auto Spell Suggestion for High Quality Speech Synthesis in Hindi , 2014, ArXiv.

[4]  Eric Brill,et al.  Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users , 2004, EMNLP.

[5]  Dan Roth,et al.  A Winnow-Based Approach to Context-Sensitive Spelling Correction , 1998, Machine Learning.

[6]  Andrew Carlson,et al.  Memory-based context-sensitive spelling correction at web scale , 2007, Sixth International Conference on Machine Learning and Applications (ICMLA 2007).

[7]  Kenneth Ward Church,et al.  A Spelling Correction Program Based on a Noisy Channel Model , 1990, COLING.

[8]  Yves Schabes,et al.  Combining Trigram-based and Feature-based Methods for Context-Sensitive Spelling Correction , 1996, ACL.

[9]  India. Central Hindi Directorate A basic grammar of modern Hindi , 1975 .

[10]  Ronald Rosenfeld,et al.  Scalable backoff language models , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[11]  Monojit Choudhury,et al.  Challenges in Designing Input Method Editors for Indian Lan-guages: The Role of Word-Origin and Context , 2011, WTIM@IJCNLP.

[12]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[13]  Venu Govindaraju,et al.  Devanagari OCR using a recognition driven segmentation framework and stochastic language models , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[14]  Davide Fossati,et al.  I saw TREE trees in the park: How to Correct Real-Word Spelling Mistakes , 2008, LREC.

[15]  Marcos Zampieri,et al.  Effective Spell Checking Methods Using Clustering Algorithms , 2013, RANLP.

[16]  Rupert Snell Teach Yourself Hindi , 2000 .

[17]  Yan Zhang,et al.  A Correcting Model Based on Tribayes for Real-Word Errors in English Essays , 2012, 2012 Fifth International Symposium on Computational Intelligence and Design.

[18]  J. P. Gupta,et al.  A TENGRAM method based part-of-speech tagging of multi-category words in Hindi language , 2011, Expert Syst. Appl..

[19]  Robert L. Mercer,et al.  Context based spelling correction , 1991, Inf. Process. Manag..

[20]  Bidyut Baran Chaudhuri,et al.  A simple real-word error detection and correction using local word bigram and trigram , 2013, ROCLING/IJCLCLP.

[21]  James H. Martin,et al.  Contextual Spelling Correction Using Latent Semantic Analysis , 1997, ANLP.

[22]  David Yarowsky,et al.  DECISION LISTS FOR LEXICAL AMBIGUITY RESOLUTION: Application to Accent Restoration in Spanish and French , 1994, ACL.

[23]  Kenneth Ward Church,et al.  Probability scoring for spelling correction , 1991 .

[24]  Ronald Stuart McGregor Outline of Hindi grammar,: With exercises, , 1972 .

[25]  Graeme Hirst,et al.  Correcting real-word spelling errors by restoring lexical cohesion , 2005, Natural Language Engineering.

[26]  Manoj Kumar Sharma,et al.  Word Prediction System for Text Entry in Hindi , 2014, ACM Trans. Asian Lang. Inf. Process..

[27]  James L. Peterson,et al.  Computer programs for detecting and correcting spelling errors , 1980, CACM.

[28]  Veena Bansal,et al.  Partitioning and searching dictionary for correction of optically read Devanagari character strings , 2002, International Journal on Document Analysis and Recognition.

[29]  Coskun Bayrak,et al.  Estimation of quality of service in spelling correction using Kullback-Leibler divergence , 2011, Expert Syst. Appl..

[30]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[31]  R. Mahesh K. Sinha,et al.  A Journey from Indian Scripts Processing to Indian Language Processing , 2009, IEEE Annals of the History of Computing.

[32]  Andrew R. Golding,et al.  A Bayesian Hybrid Method for Context-sensitive Spelling Correction , 1996, VLC@ACL.

[33]  Martin Reynaert Character confusion versus focus word-based correction of spelling and OCR variants in corpora , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[34]  Amita Jain,et al.  Detection and correction of non word spelling errors in Hindi language , 2014, 2014 International Conference on Data Mining and Intelligent Computing (ICDMIC).

[35]  Martin Reynaert,et al.  All, and only, the Errors: more Complete and Consistent Spelling and OCR-Error Correction Evaluation , 2008, LREC.

[36]  R. K. Joshi A Phonemic Code Based Scheme for Effective Processing of Indian Languages , 2003 .

[37]  Ritu Aggrawal HINDI EDITOR WITH SPELL CHECKER , 2007 .

[38]  Jr. G. Forney,et al.  Viterbi Algorithm , 1973, Encyclopedia of Machine Learning.

[39]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[40]  Harsharndeep Singh,et al.  Design and Implementation of HINSPELL -Hindi Spell Checker using Hybrid approach , 2015 .

[41]  Graeme Hirst,et al.  Real-Word Spelling Correction with Trigrams: A Reconsideration of the Mays, Damerau, and Mercer Model , 2008, CICLing.

[42]  Amita Jain,et al.  Fuzzy Hindi WordNet and Word Sense Disambiguation Using Fuzzy Graph Connectivity Measures , 2015, TALLIP.

[43]  Harry Wechsler,et al.  Conventional and associative memory approaches to automatic spelling correction , 1992 .

[44]  Christiane Fellbaum,et al.  Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms , 1998 .

[45]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[46]  S. Verberne Context-sensitive Spell Checking Based on Word Trigram Probabilities Context-sensitive Spell Checking Based on Word Trigram Probabilities , 2002 .

[47]  G. Hemantha Kumar,et al.  Multilingual OCR system for South Indian scripts and English documents: An approach based on Fourier transform and principal component analysis , 2008, Eng. Appl. Artif. Intell..

[48]  Kumiko Tanaka-Ishii,et al.  Text Entry Systems: Mobility, Accessibility, Universality , 2007 .

[49]  Na'im R. Tyson,et al.  Prosodic rules for schwa-deletion in hindi text-to-speech synthesis , 2009, Int. J. Speech Technol..