Combining Confidence Score and Mal-rule Filters for Automatic Creation of Bangla Error Corpus: Grammar Checker Perspective

This paper describes a novel approach for automatic creation of Bangla error corpus for training and evaluation of grammar checker systems. The procedure begins with automatic creation of large number of erroneous sentences from a set of grammatically correct sentences. A statistical Confidence Score Filter has been implemented to select proper samples from the generated erroneous sentences such that sentences with less probable word sequences get lower confidence score and vice versa. Rule based Mal-rule filter with HMM based semi-supervised POS tagger has been used to collect the sentences having improper tag sequences. Combination of these two filters ensures the robustness of the proposed approach such that no valid construction is getting selected within the synthetically generated error corpus. Though the present work focuses on the most frequent grammatical errors in Bangla written text, detail taxonomy of grammatical errors in Bangla is also presented here, with an aim to increase the coverage of the error corpus in future. The proposed approach is language independent and could be easily applied for creating similar corpora in other languages.

[1]  Kamel Smaïli,et al.  Efficient combination of confidence measures for machine translation , 2009, INTERSPEECH.

[2]  David M. W. Powers Learning and Application of Differential Grammars , 1997, CoNLL.

[3]  Josef van Genabith,et al.  A Comparative Evaluation of Deep and Shallow Approaches to the Automatic Detection of Common Grammatical Errors , 2007, EMNLP.

[4]  Fernando Sánchez León,et al.  GramCheck: A Grammar and Style Checker , 1996, COLING.

[5]  Stephanie Seneff,et al.  Correcting Misuse of Verb Forms , 2008, ACL.

[6]  Joseph Paul Stemberger,et al.  Syntactic errors in speech , 1982 .

[7]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[8]  Gregor Thurmair Parsing for Grammar and Style Checking , 1990, COLING.

[9]  Jennifer Foster Good reasons for noting bad grammar : empirical investigations into the parsing of ungrammatical written English , 2005 .

[10]  Ido Dagan,et al.  Mistake-Driven Learning in Text Categorization , 1997, EMNLP.

[11]  Ola Knutsson,et al.  Faking Errors to Avoid Making Errors: Very Weakly Supervised Learning for Error Detection in Writing , 2005 .

[12]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[13]  Michael Gamon,et al.  Correcting ESL Errors Using Phrasal SMT Techniques , 2006, ACL.

[14]  Lisa N. Michaud,et al.  An intelligent tutoring system for deaf learners of written English , 2000, Assets '00.

[15]  Jennifer Foster,et al.  GenERRate: Generating Errors for Use in Grammatical Error Detection , 2009, BEA@NAACL.

[16]  Chung-Hsien Wu,et al.  Word Order Correction for Language Transfer Using Relative Position Language Modeling , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[17]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.