Automatically Extracting Typical Syntactic Differences from Corpora

We develop an aggregate measure of syntactic difference for automatically find- ing common syntactic differences between collections of text. With the use of this measure, it is possible to mine for differences between, for example, the English of learners and natives, or between related dialects. If formulated in advance, hypotheses can also be tested for statistical significance. It enables us to find not only absence or presence, but also under- and overuse of specific constructs. We have applied our measure to the English of Finnish immigrants in Australia to look for traces of Finnish grammar in their English. The outcomes of this de- tection process were analysed and found to be insightful. A report is included in this article. Besides explaining our method, we also go into the theory behind it, including permutation statistics, and the custom normalizations required for applying these tests to syntactical data. We also explain how to use the software we developed to apply this method to new corpora, and give some suggestions for further research.

[1]  Geoffrey Leech,et al.  Corpus Annotation: Linguistic Information from Computer Text Corpora , 1997 .

[2]  John Nerbonne,et al.  Detecting Syntactic Contamination in Emigrants: The English of Finnish Emigrants , 2007 .

[3]  Janusz Arabski,et al.  Cross-linguistic influences in the second language lexicon , 2006 .

[4]  DAVID SANKOFF,et al.  Borrowing: the synchrony of integration , 1984 .

[5]  G. Watson,et al.  The Finnish-Australian English Corpus* , 1997 .

[6]  E. Lenneberg Biological Foundations of Language , 1967 .

[7]  Monika S. Schmid,et al.  First Language Attrition, Use and Maintenance: The case of German Jews in anglophone countries , 2002 .

[8]  Bill VanPatten,et al.  Second Language Acquisition: Foreign Language Learning , 1990 .

[9]  C. Fillmore,et al.  Grammatical constructions and linguistic generalizations: The What's X doing Y? construction , 1999 .

[10]  Jay H. Jasanoff Language Contact, Creolization, and Genetic Linguistics , 1988 .

[11]  Erhard W. Hinrichs,et al.  Linguistic Distances , 2006 .

[12]  Terence Odlin,et al.  Chapter 3. Could a Contrastive Analysis Ever be Complete , 2006 .

[13]  J. Chambers,et al.  Sociolinguistic theory : linguistic variation and its socialsignificance , 1995 .

[14]  William C. Ritchie,et al.  Handbook of Child Language Acquisition , 1998 .

[15]  Michael H. Long,et al.  An introduction to second language acquisition research , 1990 .

[16]  A. Agresti An introduction to categorical data analysis , 1997 .

[17]  P. Nelde Languages in contact , 1990 .

[18]  Yogendra P. Chaubey Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[19]  Edward Vanhoutte Literary and Linguistic Computing , 1986 .

[20]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[21]  Sylviane Granger,et al.  Learner English on Computer , 1998 .

[22]  Ellen Riloff,et al.  Proceedings of the Student Research Workshop , 2007 .

[23]  Natalia Ignatieva Solianik Language Transfer: Crosslinguistic Influence in Language Learning. Por Terence Odlin. Cambridge: CUP; 1989 , 1993 .

[24]  John Nerbonne,et al.  Detecting Syntactic Contamination in Emigrants: The English of Finnish Australians , 2007 .

[25]  M. Waas Language Attrition Downunder: German Speakers in Australia , 1996 .

[26]  H. van Halteren,et al.  Outside the cave of shadows: using syntactic annotation to enhance authorship attribution , 1996 .

[27]  Wander Lowie,et al.  Second Language Acquisition: An Advanced Resource Book , 2006 .

[28]  Markku Filppula,et al.  Vernacular universals and language contacts : evidence from varieties of English and beyond , 2009 .

[29]  Eugene S. Edgington,et al.  Randomization Tests , 2011, International Encyclopedia of Statistical Science.

[30]  Sarah G. Thomason,et al.  Language Contact: An Introduction , 2001 .

[31]  Terence Odlin,et al.  Language Transfer: Cross-Linguistic Influence in Language Learning , 1989 .

[32]  Nathan C. Sanders Measuring Syntactic Difference in British English , 2007, ACL.

[33]  S. S. Young,et al.  Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[34]  ZhaoHong Han,et al.  EFFECTS OF THE SECOND LANGUAGE ON THE FIRST , 2004, Studies in Second Language Acquisition.

[35]  M. Schmid Identity and first language attrition: a historical approach , 2007 .

[36]  Brett Kessler,et al.  Book Reviews: The Significance of Word Lists , 2001, CL.

[37]  H. J. Arnold Introduction to the Practice of Statistics , 1990 .

[38]  W. Wiersma,et al.  A Measure of Aggregate Syntactic Distance , 2006 .

[39]  Peter Sells,et al.  Lectures on contemporary syntactic theories , 1985 .

[40]  Thomas E. Nichols,et al.  Controlling the familywise error rate in functional neuroimaging: a comparative review , 2003, Statistical methods in medical research.

[41]  Ingrid Piller,et al.  Passing for a native speaker: Identity and success in second language learning , 2002 .

[42]  Gabriele Kasper,et al.  Strategies in interlanguage communication , 1983 .

[43]  Sylviane Granger,et al.  Tag sequences in learner corpora: a key to interlanguage grammar and discourse , 1998 .

[44]  Thomas E. Nichols,et al.  Nonparametric permutation tests for functional neuroimaging: A primer with examples , 2002, Human brain mapping.

[45]  P. Good,et al.  Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses , 1995 .

[46]  F. V. Coetsem Loan phonology and the two transfer types in language contact , 1988 .

[47]  Graeme Hirst,et al.  Bigrams of Syntactic Labels for Authorship Discrimination of Short Texts , 2007, Lit. Linguistic Comput..

[48]  S. Kipp,et al.  Australia's community languages , 2006 .

[49]  Chapter 5. Probing the Effects of the L2 on the L1: A Case Study , 2003 .

[50]  Geoffrey Sampson,et al.  A proposal for improving the measurement of parse accuracy , 2000 .

[51]  D. Sankoff,et al.  The social correlates and linguistic processes of lexical borrowing and assimilation , 1988 .

[52]  Rod Ellis,et al.  The Study of Second Language Acquisition , 1994 .

[53]  Bas Aarts,et al.  Exploring Natural Language: Working with the British Component of the International Corpus of English , 2002 .