L1-L2 Parallel Treebank of Learner Chinese: Overused and Underused Syntactic Structures

We present a preliminary analysis on a corpus of texts written by learners of Chinese as a foreign language (CFL), annotated in the form of an L1-L2 parallel dependency treebank. The treebank consists of parse trees of sentences written by CFL learners (“L2 sentences”), parse trees of their target hypotheses (“L1 sentences”), and word alignment between the L1 sentences and L2 sentences. Currently, the treebank consists of 600 L2 sentences and 697 L1 sentences. We report the most overused and underused syntactic relations by the CFL learners, and discuss the underlying learner errors.

[1]  Hwee Tou Ng,et al.  Building a Large Annotated Corpus of Learner English: The NUS Corpus of Learner English , 2013, BEA@NAACL-HLT.

[2]  Anke Lüdeling,et al.  Competing Target Hypotheses in the Falko Corpus: A Flexible Multi-Layer Corpus Architecture , 2011 .

[3]  Matt Post,et al.  Reassessing the Goals of Grammatical Error Correction: Fluency Instead of Grammaticality , 2016, TACL.

[4]  Xinying Chen,et al.  Developing Universal Dependencies for Mandarin Chinese , 2016, ALR@COLING.

[5]  John Lee,et al.  Towards Universal Dependencies for Learner Chinese , 2017, UDW@NoDaLiDa.

[6]  Lung-Hao Lee,et al.  Overview of NLP-TEA 2016 Shared Task for Chinese Grammatical Error Diagnosis , 2016, NLP-TEA@COLING.

[7]  John Lee,et al.  L1-L2 Parallel Dependency Treebank as Learner Corpus , 2017, IWPT.

[8]  Alain Peyraube Motion events in Chinese , 2006 .

[9]  Shervin Malmasi,et al.  The Jinan Chinese Learner Corpus , 2015, BEA@NAACL-HLT.

[10]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[11]  Boris Katz,et al.  Universal Dependencies for Learner English , 2016, ACL.

[12]  Charles N. Li,et al.  Mandarin Chinese: A Functional Reference Grammar , 1989 .

[13]  Keisuke Sakaguchi,et al.  Phrase Structure Annotation and Parsing for Learner English , 2016, ACL.

[14]  Walt Detmar Meurers,et al.  Towards interlanguage POS annotation for effective learner corpora in SLA and FLT , 2009 .

[15]  Helen Yannakoudakis,et al.  A New Dataset and Method for Automatically Grading ESOL Texts , 2011, ACL.

[16]  Edward W. D. Whittaker,et al.  Creating a manually error-tagged and shallow-parsed learner corpus , 2011, ACL.

[17]  Paul Rayson,et al.  From key words to key semantic domains , 2008 .

[18]  Sylviane Granger,et al.  Contrastive interlanguage analysis: A reappraisal , 2015 .