Grammatical and context‐sensitive error correction using a statistical machine translation framework

Producing electronic rather than paper documents has considerable benefits such as easier organizing and data management. Therefore, existence of automatic writing assistance tools such as spell and grammar checker/correctors can increase the quality of electronic texts by removing noise and correcting the erroneous sentences. Different kinds of errors in a text can be categorized into spelling, grammatical and real‐word errors. In this article, we present a language‐independent approach based on a statistical machine translation framework to develop a proofreading tool, which detects grammatical errors as well as context‐sensitive spelling mistakes (real‐word errors). A hybrid model for grammar checking is suggested by combining the mentioned approach with an existing rule‐based grammar checker. Experimental results on both English and Persian languages indicate that the proposed statistical method and the rule‐based grammar checker are complementary in detecting and correcting syntactic errors. The results of the hybrid grammar checker, applied to some English texts, show an improvement of about 24% with respect to the recall metric with almost similar value for precision. Experiments on real‐world data set show that state‐of‐the‐art results are achieved for grammar checking and context‐sensitive spell checking for Persian language. Copyright © 2012 John Wiley & Sons, Ltd.

[1]  Robert L. Mercer,et al.  Context based spelling correction , 1991, Inf. Process. Manag..

[2]  PietraVincent J. Della,et al.  The mathematics of statistical machine translation , 1993 .

[3]  Antti Arppe Developing a grammar checker for Swedish , 1999, NODALIDA.

[4]  Michael Gamon,et al.  Correcting ESL Errors Using Phrasal SMT Techniques , 2006, ACL.

[5]  Andrew R. Golding,et al.  A Bayesian Hybrid Method for Context-sensitive Spelling Correction , 1996, VLC@ACL.

[6]  Claudia Leacock,et al.  Automated Grammatical Error Detection for Language Learners , 2010, Synthesis Lectures on Human Language Technologies.

[7]  Marcin Milkowski,et al.  Developing an open‐source, rule‐based proofreading tool , 2010, Softw. Pract. Exp..

[8]  Karine Megerdoomian,et al.  Finite-State Morphological Analysis of Persian , 2004 .

[9]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[10]  Dan Roth,et al.  A Winnow-Based Approach to Context-Sensitive Spelling Correction , 1998, Machine Learning.

[11]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[12]  Heshaam Faili Detection and correction of real-word spelling errors in Persian language , 2010, Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010).

[13]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[14]  James H. Martin,et al.  Contextual Spelling Correction Using Latent Semantic Analysis , 1997, ANLP.

[15]  Mehrnoush Shamsfard,et al.  STeP-1: A Set of Fundamental Tools for Persian Text Processing , 2010, LREC.

[16]  Fernando Sánchez León,et al.  GramCheck: A Grammar and Style Checker , 1996, COLING.

[17]  Michel Simard,et al.  Statistical Phrase-Based Post-Editing , 2007, NAACL.

[18]  Amittai Axelrod,et al.  Factored Language Models for Statistical Machine Translation , 2006 .

[19]  Stephanie Seneff,et al.  Automatic grammar correction for second-language learners , 2006, INTERSPEECH.

[20]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[21]  Zoran Bosnić,et al.  Extending applications using an advanced approach to DLL injection and API hooking , 2010 .

[22]  Eiríkur Rögnvaldsson,et al.  Context-Sensitive Spelling Correction and Rich Morphology , 2009, NODALIDA.

[23]  Ming Zhou,et al.  Detecting Erroneous Sentences using Automatically Mined Sequential Patterns , 2007, ACL.

[24]  Martin Chodorow,et al.  An Unsupervised Method for Detecting Grammatical Errors , 2000, ANLP.

[25]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[26]  Heshaam Faili,et al.  Towards grammar checker development for Persian language , 2010, Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010).

[27]  S. Seneff,et al.  Interlingua-based translation for language learning systems , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[28]  Meenu Bhagat,et al.  Spelling Error Pattern Analysis of Punjabi Typed Text , 2007 .

[29]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[30]  Ola Knutsson,et al.  Faking Errors to Avoid Making Errors: Very Weakly Supervised Learning for Error Detection in Writing , 2005 .

[31]  Chris Callison-Burch,et al.  Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding , 2006 .

[32]  Yves Schabes,et al.  Combining Trigram-based and Feature-based Methods for Context-Sensitive Spelling Correction , 1996, ACL.

[33]  Graeme Hirst,et al.  Real-Word Spelling Correction with Trigrams: A Reconsideration of the Mays, Damerau, and Mercer Model , 2008, CICLing.

[34]  Kiyotaka Uchimoto,et al.  The NICT JLE Corpus Exploiting the language learners' speech database for research and education , 2004 .

[35]  D Nicholls,et al.  The Cambridge Learner Corpus-Error coding and analysis , 1999 .

[36]  Marcin Miłkowski,et al.  UNCORRECTED DRAFT . For the final version , see Automated Building of Error Corpora of Polish , in , 2009 .

[37]  Sylviane Granger,et al.  The International Corpus of Learner English , 1993 .

[38]  Benoît Sagot,et al.  A Morphological Lexicon for the Persian Language , 2010, LREC.

[39]  Khaled F. Shaalan,et al.  Arabic GramCheck: a grammar checker for Arabic , 2005, Softw. Pract. Exp..

[40]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[41]  Dan Roth,et al.  Scaling Up Context-Sensitive Text Correction , 2001, IAAI.

[42]  Roland Kuhn,et al.  Rule-Based Translation with Statistical Phrase-Based Post-Editing , 2007, WMT@ACL.

[43]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[44]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing , 2000 .

[45]  Jonas Sjöbergh,et al.  Faking Errors to Avoid Making Errors : Machine Learning for Error Detection in Writing , 2004 .

[46]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.