Discriminative reranking for context-sensitive spell-checker

Nowadays, a large amount of documents is generated daily. These documents may contain some spelling errors which should be detected and corrected by using a proofreading tool. Therefore, the existence of automatic writing assistance tools such as spell-checkers/correctors could help to improve their quality. Spelling errors could be categorized into five categories. One of them is real-word errors, which are misspelled words that have been wrongly converted into another word in the language. Detection of such errors requires discourse analysis rather than just checking the word in a dictionary. We propose a discourse-aware discriminative model to improve the results of context-sensitive spell-checkers by reranking their resulted n-best list. We augment the proposed reranker into two existing context-sensitive spell-checker systems; one of them is based on statistical machine translation and the other one is based on language model. We choose the keywords of the whole document as contextual features of the model and improve the results of both systems by employing the features in a log-linear reranker system. We evaluated the system on two different languages: English and Persian. The results of the experiments in English language on the Wall street journal test set show improvements of 4.5% and 5.2% in detection and correction recall, respectively, in comparison to the baseline method. The mentioned improvement on recall metric was achieved with comparable precision. We also achieve state-of-the-art performance on the Persian language.

[1]  Andrew R. Golding,et al.  A Bayesian Hybrid Method for Context-sensitive Spelling Correction , 1996, VLC@ACL.

[2]  James H. Martin,et al.  Contextual Spelling Correction Using Latent Semantic Analysis , 1997, ANLP.

[3]  Karine Megerdoomian,et al.  Unification-Based Persian Morphology , 1999 .

[4]  Alexander Gelbukh,et al.  Computational Linguistics and Intelligent Text Processing , 2015, Lecture Notes in Computer Science.

[5]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[6]  Chih-Fong Tsai,et al.  Training support vector machines based on stacked generalization for image classification , 2005, Neurocomputing.

[7]  Gilbert Lazard,et al.  A grammar of contemporary Persian , 1994 .

[8]  Heshaam Faili,et al.  Automatic Persian WordNet Construction , 2010, COLING.

[9]  Eric Atwell,et al.  Dealing with ill-formed English text , 1987 .

[10]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[11]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[12]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[13]  Youssef Bassil,et al.  Context-sensitive Spelling Correction Using Google Web 1T 5-Gram Information , 2012, Comput. Inf. Sci..

[14]  Heshaam Faili,et al.  Discourse-aware Statistical Machine Translation as a Context-sensitive Spell Checker , 2013, RANLP.

[15]  Eric Atwell,et al.  Large-scale lexical semantics for speech recognition support , 1997, EUROSPEECH.

[16]  Mohsen Sharifi,et al.  A novel string distance metric for ranking Persian respelling suggestions , 2012, Natural Language Engineering.

[17]  Robert L. Mercer,et al.  Context based spelling correction , 1991, Inf. Process. Manag..

[18]  Ian Marshall,et al.  Choice of grammatical word-class without global syntactic analysis: Tagging words in the lob corpus , 1983, Comput. Humanit..

[19]  Graeme Hirst,et al.  Correcting real-word spelling errors by restoring lexical cohesion , 2005, Natural Language Engineering.

[20]  Diana Inkpen,et al.  Real-word spelling correction using Google web 1Tn-gram data set , 2009, CIKM.

[21]  David Yarowsky,et al.  A method for disambiguating word senses in a large corpus , 1992, Comput. Humanit..

[22]  Graeme Hirst,et al.  Real-Word Spelling Correction with Trigrams: A Reconsideration of the Mays, Damerau, and Mercer Model , 2008, CICLing.

[23]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[24]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[25]  Heshaam Faili Detection and correction of real-word spelling errors in Persian language , 2010, Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010).

[26]  Tayebeh Mosavi Miangah FarsiSpell: A spell-checking system for Persian using a large monolingual corpus , 2014, Lit. Linguistic Comput..

[27]  Mehrnoush Shamsfard,et al.  STeP-1: A Set of Fundamental Tools for Persian Text Processing , 2010, LREC.

[28]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[29]  Dan Roth,et al.  A Winnow-Based Approach to Context-Sensitive Spelling Correction , 1998, Machine Learning.

[30]  Heshaam Faili,et al.  Grammatical and context‐sensitive error correction using a statistical machine translation framework , 2013, Softw. Pract. Exp..

[31]  Taro Watanabe,et al.  Structural support vector machines for log-linear approach in statistical machine translation , 2009, IWSLT.

[32]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[33]  David Yarowsky,et al.  DECISION LISTS FOR LEXICAL AMBIGUITY RESOLUTION: Application to Accent Restoration in Spanish and French , 1994, ACL.

[34]  Yves Schabes,et al.  Combining Trigram-based and Feature-based Methods for Context-Sensitive Spelling Correction , 1996, ACL.

[35]  Kuo Zhang,et al.  Keyword extraction based on tf/idf for Chinese news document , 2007, Wuhan University Journal of Natural Sciences.

[36]  Masoud Rahgozar,et al.  Hamshahri: A standard Persian text collection , 2009, Knowl. Based Syst..

[37]  Davide Fossati,et al.  A Mixed Trigrams Approach for Context Sensitive Spell Checking , 2009, CICLing.

[38]  Jin-Tsong Jeng,et al.  Hybrid approach of selecting hyperparameters of support vector machine for regression , 2005, IEEE Trans. Syst. Man Cybern. Part B.

[39]  Alexandr Rosen,et al.  Korektor – A System for Contextual Spell-Checking and Diacritics Completion , 2012, COLING.