Exploiting Parse Structures for Native Language Identification

Attempts to profile authors according to their characteristics extracted from textual data, including native language, have drawn attention in recent years, via various machine learning approaches utilising mostly lexical features. Drawing on the idea of contrastive analysis, which postulates that syntactic errors in a text are to some extent influenced by the native language of an author, this paper explores the usefulness of syntactic features for native language identification. We take two types of parse substructure as features---horizontal slices of trees, and the more general feature schemas from discriminative parse reranking---and show that using this kind of syntactic feature results in an accuracy score in classification of seven native languages of around 80%, an error reduction of more than 30%.

[1]  Mark Shea,et al.  INTERNATIONAL CORPUS OF LEARNER ENGLISH: VERSION 2 . Sylvaine Granger, Estelle Dagneaux, Fanny Meunier, and Magali Paquot (Eds.). Louvain-La-Neuve, France: Presses Universitaires de Louvain, 2009. Pp. 223. , 2011, Studies in Second Language Acquisition.

[2]  Justin Falkus,et al.  The Contrastive Analysis Hypothesis , 2012 .

[3]  Jennifer Foster,et al.  Using Parse Features for Preposition Selection and Error Detection , 2010, ACL.

[4]  Jack C. Richards,et al.  A non-contrastive approach to error analysis , 1970 .

[5]  Josef van Genabith,et al.  Adapting a WSJ-Trained Parser to Grammatically Noisy Text , 2008, ACL.

[6]  S. Granger,et al.  Connector usage in the English essay writing of native and non‐native EFL speakers of English , 1996 .

[7]  Roumyana Slabakova,et al.  L1 transfer revisited: the L2 acquisition of telicity marking in English by Spanish and Bulgarian native speakers , 2000 .

[8]  Gabriella Vigliocco,et al.  Subject-verb agreement errors in French and English: The role of syntactic hierarchy , 2002 .

[9]  James R. Curran,et al.  Reranking a wide-coverage ccg parser , 2010, ALTA.

[10]  Ari Rappoport,et al.  Using Classifier Features for Studying the Effect of Native Language on the Choice of Written Second Language Words , 2007 .

[11]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[12]  James R. Curran,et al.  Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models , 2007, Computational Linguistics.

[13]  Wiebke Wagner,et al.  Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[14]  Mark Johnson,et al.  Reranking the Berkeley and Brown Parsers , 2010, HLT-NAACL.

[15]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[16]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[17]  Josef van Genabith,et al.  Judging Grammaticality: Experiments in Sentence Classification , 2013, CALICO Journal.

[18]  Mark Dras,et al.  Parser Features for Sentence Grammaticality Classification , 2010, ALTA.

[19]  Moshe Koppel,et al.  Automatically Determining an Anonymous Author's Native Language , 2005, ISI.

[20]  R. Hamilton The Insignificance of Learners' Errors: A Philosophical Investigation of the Interlanguage Hypothesis. , 2001 .

[21]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[22]  Mark Johnson,et al.  Estimators for Stochastic “Unification-Based” Grammars , 1999, ACL.

[23]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[24]  Michael Gamon,et al.  Linguistic correlates of style: authorship classification with deep linguistic analysis features , 2004, COLING.

[25]  H. van Halteren,et al.  Outside the cave of shadows: using syntactic annotation to enhance authorship attribution , 1996 .

[26]  B. MacWhinney,et al.  The Crosslinguistic Study of Sentence Processing. , 1992 .

[27]  Brian Butterworth,et al.  Subject-verb agreement in Spanish and English: Differences in the role of conceptual constraints , 1996, Cognition.

[28]  Irena Vassileva Who am I/who are we in academic writing?L , 1998 .

[29]  Suying Yang,et al.  The Impact of the Absence of Grammatical Tense in L1 on the Acquisition of the Tense-Aspect System in L2. , 2004 .

[30]  Charles E. Frank,et al.  Introduction to Phishing , 2008 .

[31]  Markus Jakobsson,et al.  Introduction to Phishing , 2006 .

[32]  Norman M. Sadeh,et al.  Learning to detect phishing emails , 2007, WWW '07.

[33]  Ming Zhou,et al.  Detecting Erroneous Sentences using Automatically Mined Sequential Patterns , 2007, ACL.

[34]  Markus Jakobsson,et al.  Phishing and Countermeasures: Understanding the Increasing Problem of Electronic Identity Theft , 2006 .

[35]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[36]  Robert C. Pooley Subject-Verb Agreement , 1934 .

[37]  Stephen Wan,et al.  GLEU: Automatic Evaluation of Sentence-Level Fluency , 2007, ACL.

[38]  R. Lado,et al.  Linguistics Across Cultures: Applied Linguistics for Language Teachers , 1957 .

[39]  Hans van Halteren,et al.  Source Language Markers in EUROPARL Translations , 2008, COLING.

[40]  Anj Foley,et al.  Learner English: A Teacher's Guide to Interference and Other Problems Second Edition [Book Review] , 2002 .

[41]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[42]  Rong Zheng,et al.  Authorship Analysis in Cybercrime Investigation , 2003, ISI.

[43]  Mark Dras,et al.  Contrastive Analysis and Native Language Identification , 2009, ALTA.

[44]  Seán Devitt,et al.  Language Processing and Second Language Development: Processability Theory , 2000 .

[45]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.