Topic Modeling for Native Language Identification

Native language identification (NLI) is the task of determining the native language of an author writing in a second language. Several pieces of earlier work have found that features such as function words, part-of-speech n-grams and syntactic structure are helpful in NLI, perhaps representing characteristic errors of different native language speakers. This paper looks at the idea of using Latent Dirichlet Allocation as a feature clustering technique over lexical features to see whether there is any evidence that these smaller-scale features do cluster into more coherent latent factors, and investigates their effect in a classification task. We find that although (not unexpectedly) classification accuracy decreases, there is some evidence of coherent clustering, which could help with much larger syntactic feature spaces.

[1]  Mark Dras,et al.  Contrastive Analysis and Native Language Identification , 2009, ALTA.

[2]  Ivan Titov,et al.  A Joint Model of Text and Aspect Ratings for Sentiment Summarization , 2008, ACL.

[3]  R. Lado,et al.  Linguistics Across Cultures: Applied Linguistics for Language Teachers , 1957 .

[4]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[7]  Dilin Liu,et al.  Acquisition of the Article The by Nonnative Speakers of English: An Analysis of Four Nongeneric Uses , 2002 .

[8]  J. Schachter AN ERROR IN ERROR ANALYSIS1 , 1974 .

[9]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[10]  Tom Minka,et al.  Expectation-Propogation for the Generative Aspect Model , 2002, UAI.

[11]  Ingrid Zukerman,et al.  Authorship Attribution with Latent Dirichlet Allocation , 2011, CoNLL.

[12]  Mark Dras,et al.  Exploiting Parse Structures for Native Language Identification , 2011, EMNLP.

[13]  Ari Rappoport,et al.  Using Classifier Features for Studying the Effect of Native Language on the Choice of Written Second Language Words , 2007 .

[14]  Rod Ellis,et al.  The Study of Second Language Acquisition , 1994 .

[15]  Mark Johnson,et al.  PCFGs, Topic Models, Adaptor Grammars and Learning Topical Collocations and the Structure of Proper Names , 2010, ACL.

[16]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[17]  Jácint Szabó,et al.  Latent dirichlet allocation in web spam filtering , 2008, AIRWeb '08.

[18]  Mark Johnson,et al.  Reducing Grounded Learning Tasks To Grammatical Inference , 2011, EMNLP.

[19]  Willis Edmondson,et al.  The study of second language acquisition , 1995 .

[20]  Mark Shea,et al.  INTERNATIONAL CORPUS OF LEARNER ENGLISH: VERSION 2 . Sylvaine Granger, Estelle Dagneaux, Fanny Meunier, and Magali Paquot (Eds.). Louvain-La-Neuve, France: Presses Universitaires de Louvain, 2009. Pp. 223. , 2011, Studies in Second Language Acquisition.

[21]  Chong Wang,et al.  Simultaneous image classification and annotation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Moshe Koppel,et al.  Automatically Determining an Anonymous Author's Native Language , 2005, ISI.

[23]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[24]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[25]  M. N. Murty,et al.  Stopwords and Stylometry : A Latent Dirichlet Allocation Approach , 2009 .