The Jinan Chinese Learner Corpus

We present the Jinan Chinese Learner Corpus, a large collection of L2 Chinese texts produced by learners that can be used for educational tasks. The present work introduces the data and provides a detailed description. Currently, the corpus contains approximately 6 million Chinese characters written by students from over 50 different L1 backgrounds. This is a large-scale corpus of learner Chinese texts which is freely available to researchers either through a web interface or as a set of raw texts. The data can be used in NLP tasks including automatic essay grading, language transfer analysis and error detection and correction. It can also be used in applied and corpus linguistics to support Second Language Acquisition (SLA) research and the development of pedagogical resources. Practical applications of the data and future directions are discussed.

[1]  Shervin Malmasi,et al.  Arabic Native Language Identification , 2014, ANLP@EMNLP.

[2]  Håkan Ringbom,et al.  Chapter 4. Lexical Transfer in L3 Production , 2001 .

[3]  Nicolas Ballier,et al.  Automatic Treatment and Analysis of Learner Corpus Data , 2013 .

[4]  Sylviane Granger,et al.  A Bird’s-eye view of learner corpus research , 2002 .

[5]  Mamoru Komachi,et al.  Discriminative Approach to Fill-in-the-Blank Quiz Generation for Language Learners , 2013, ACL.

[6]  Shirin Murphy Second Language Transfer During Third Language Acquisition , 2003 .

[7]  Shervin Malmasi,et al.  Finnish Native Language Identification , 2014, ALTA.

[8]  Anna Feldman,et al.  Annotating an Arabic Learner Corpus for Error , 2008, LREC.

[9]  Na-Rae Han,et al.  Using an Error-Annotated Learner Corpus to Develop an ESL/EFL Error Correction System , 2010, LREC.

[10]  Jinfa Cai,et al.  Teaching and Learning Chinese: Issues and Perspectives. Chinese American Educational Research and Development Association Book Series. , 2010 .

[11]  Hwee Tou Ng,et al.  The CoNLL-2013 Shared Task on Grammatical Error Correction , 2013, CoNLL Shared Task.

[12]  Alex Housen,et al.  A corpus-based study of the L2-acquisition of the English verb system , 2002 .

[13]  Sylviane Granger,et al.  Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching , 2002 .

[14]  Martin Chodorow,et al.  TOEFL11: A CORPUS OF NON‐NATIVE ENGLISH , 2013 .

[15]  Shervin Malmasi,et al.  Large-Scale Native Language Identification with Cross-Corpus Evaluation , 2015, NAACL.

[16]  Kam-Fai Wong,et al.  Introduction to Chinese Natural Language Processing , 2009, Introduction to Chinese Natural Language Processing.

[17]  Cristóbal Lozano,et al.  Learner corpora and second language acquisition , 2013 .

[18]  Benjamin Swanson,et al.  Data Driven Language Transfer Hypotheses , 2014, EACL.

[19]  Joel R. Tetreault,et al.  A Report on the First Native Language Identification Shared Task , 2013, BEA@NAACL-HLT.

[20]  Tony McEnery,et al.  What Corpora Can Offer in Language Teaching and Learning , 2011 .

[21]  Michael Gamon,et al.  Using Learner Corpora for Automatic Error Detection and Correction , 2013 .

[22]  Daniel O. Jackson,et al.  Second Language Acquisition and the Critical Period Hypothesis , 2000 .

[23]  S. Granger The International Corpus of Learner English: A New Resource for Foreign Language Learning and Teaching and Second Language Acquisition Research , 2003 .

[24]  Adam Kilgarriff,et al.  Helping Our Own: The HOO 2011 Pilot Shared Task , 2011, ENLG.

[25]  Helen Yannakoudakis,et al.  A New Dataset and Method for Automatically Grading ESOL Texts , 2011, ACL.

[26]  Hwee Tou Ng,et al.  Grammatical Error Correction with Alternating Structure Optimization , 2011, ACL.

[27]  Martin Wynne,et al.  Developing Linguistic Corpora: a Guide to Good Practice , 2005 .

[28]  Shervin Malmasi,et al.  Language Transfer Hypotheses with Linear SVM Weights , 2014, EMNLP.

[29]  Robert Dale,et al.  HOO 2012: A Report on the Preposition and Determiner Error Correction Shared Task , 2012, BEA@NAACL-HLT.

[30]  Hsin-Hsi Chen,et al.  Chinese Word Ordering Errors Detection and Correction for Non-Native Chinese Language Learners , 2014, COLING.

[31]  Jianbin Huang,et al.  China’s policy of Chinese as a foreign language and the use of overseas Confucius Institutes , 2010 .

[32]  Michael Grüninger,et al.  Introduction , 2002, CACM.

[33]  Hiroshi Nakagawa,et al.  A Real-Time Multiple-Choice Question Generation For Language Testing: A Preliminary Study , 2005 .