A POS Tagging Model Adapted to Learner English

There has been very limited work on the adaptation of Part-Of-Speech (POS) tagging to learner English despite the fact that POS tagging is widely used in related tasks. In this paper, we explore how we can adapt POS tagging to learner English efficiently and effectively. Based on the discussion of possible causes of POS tagging errors in learner English, we show that deep neural models are particularly suitable for this. Considering the previous findings and the discussion, we introduce the design of our model based on bidirectional Long Short-Term Memory. In addition, we describe how to adapt it to a wide variety of native languages (potentially, hundreds of them). In the evaluation section, we empirically show that it is effective for POS tagging in learner English, achieving an accuracy of 0.964, which significantly outperforms the state-of-the-art POS-tagger. We further investigate the tagging results in detail, revealing which part of the model design does or does not improve the performance.

[1]  Ryo Nagata,et al.  Analyzing the Impact of Spelling Errors on POS-Tagging and Chunking in Learner English , 2017, NLP-TEA@IJCNLP.

[2]  Markus Dickinson,et al.  Defining Syntax for Learner Language Annotation , 2012, COLING.

[3]  Sylviane Granger,et al.  The International Corpus of Learner English. Version 2. Handbook and CD-Rom , 2009 .

[4]  Edward W. D. Whittaker,et al.  Creating a manually error-tagged and shallow-parsed learner corpus , 2011, ACL.

[5]  Edward W. D. Whittaker,et al.  Reconstructing an Indo-European Family Tree from Non-native English Texts , 2013, ACL.

[6]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[7]  Ann Bies,et al.  Bracketing Guidelines For Treebank II Style Penn Treebank Project , 1995 .

[8]  Ryo Nagata,et al.  Exploiting Learners' Tendencies for Detecting English Determiner Errors , 2011, KES.

[9]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[10]  Barbara Plank,et al.  Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss , 2016, ACL.

[11]  Beatrice Santorini Part-of-speech tagging guidelines for the penn treebank project , 1990 .

[12]  Helen Yannakoudakis,et al.  Automatic Text Scoring Using Neural Networks , 2016, ACL.

[13]  Martin Chodorow,et al.  Automated Scoring Using A Hybrid Feature Identification Technique , 1998, ACL.

[14]  Jennifer Foster,et al.  Treebanks Gone Bad: Generating a Treebank of Ungrammatical English , 2007 .

[15]  Kiyotaka Uchimoto,et al.  The NICT JLE Corpus Exploiting the language learners' speech database for research and education , 2004 .

[16]  Jennifer Foster Treebanks Gone Bad Parser Evaluation and Retraining using a Treebank of Ungrammatical Sentences , 2007 .

[17]  Walt Detmar Meurers,et al.  Towards interlanguage POS annotation for effective learner corpora in SLA and FLT , 2009 .

[18]  Noah A. Smith,et al.  Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers , 2013, ACL.

[19]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[20]  Beatrice Santorini,et al.  Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision) , 1990 .

[21]  Boris Katz,et al.  Universal Dependencies for Learner English , 2016, ACL.

[22]  Boris Katz,et al.  Reconstructing Native Language Typology from Foreign Language Usage , 2014, CoNLL.

[23]  Graham Neubig,et al.  Learning Language Representations for Typology Prediction , 2017, EMNLP.

[24]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[25]  Ted Briscoe,et al.  Automatic Annotation and Evaluation of Error Types for Grammatical Error Correction , 2017, ACL.

[26]  Sylviane Granger,et al.  Tag sequences in learner corpora: a key to interlanguage grammar and discourse , 1998 .

[27]  Yuji Matsumoto,et al.  Joint English Spelling Error Correction and POS Tagging for Language Learners Writing , 2012, COLING.

[28]  Jörg Tiedemann,et al.  Continuous multilinguality with language vectors , 2016, EACL.

[29]  Keisuke Sakaguchi,et al.  Phrase Structure Annotation and Parsing for Learner English , 2017 .

[30]  Martin Chodorow,et al.  特別講演 Techniques for Detecting Syntactic Errors in Text (特集 文章の良さ・読み易さの自動評価に向けて) , 2002 .

[31]  Markus Dickinson,et al.  Dependency Annotation for Learner Corpora , 2009 .

[32]  Sylviane Granger,et al.  The International Corpus of Learner English , 1993 .

[33]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[34]  Bertus van Rooy,et al.  The effect of learner errors on POS tag errors during automatic POS tagging , 2002 .