Measuring syntactical variation in Germanic texts

We present two new measures of syntactic distance between languages. First, we present the ‘movement measure’ which measures the average number of words that has moved in sentences of one language compared to the corresponding sentences in another language. Secondly, we introduce the ‘indel measure’ which measures the average number of words being inserted or deleted in sentences of one language compared to the corresponding sentences in another language. The two measures were compared to the ‘trigram measure’ which was introduced by Nerbonne & Wiersma (2006, A Measure of Aggregate Syntactic Distance. In Nerbonne, J. and Hinrichs, E. (eds.) Linguistic Distances Workshop at the joint conference of International Committee on Computational Linguistics and the Association for Computational Linguistics, Sydney, July, 2006, pp. 82–90.). We correlated the results of the three measures and found a low correlation between the results of the movement and indel measure, indicating that the two measures represent different kinds of linguistic variation. We found a high correlation between the results of the movement measure and the trigram measure. The results of all of the three measures suggest that English is syntactically a Scandinavian language. Because of our unique database design we were able to detect asymmetric relationships between the languages. All three measures suggest that asymmetric syntactical distances could be part of the explanation why native speakers of Dutch more easily understand German texts than native speakers of German understand Dutch texts (Swarte 2016).

[1]  Renée van Bezooijen,et al.  Phonetics in Europe : Perception and Production , 2013 .

[2]  Nathan C. Sanders Measuring Syntactic Difference in British English , 2007, ACL.

[3]  W. Heeringa,et al.  Modeling Intelligibility of Written Germanic Languages: Do We Need to Distinguish Between Orthographic Stem and Affix Variation? , 2014 .

[4]  L. Cronbach Coefficient alpha and the internal structure of tests , 1951 .

[6]  M. Spruit Quantitative perspectives on syntactic variation in Dutch dialects , 2008 .

[7]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[8]  W. Torgerson Multidimensional scaling: I. Theory and method , 1952 .

[9]  John Nerbonne,et al.  Automatically Extracting Typical Syntactic Differences from Corpora , 2011, Lit. Linguistic Comput..

[10]  T. R. Lounsbury History of the English Language , 2007 .

[11]  Jack Grieve,et al.  Regional Variation in Written American English , 2016 .

[12]  John Nerbonne,et al.  Detecting Syntactic Contamination in Emigrants: The English of Finnish Australians , 2007 .

[13]  Jack Grieve,et al.  A corpus-based regional dialect survey of grammatical variation in written standard American English , 2009 .

[14]  Graeme Hirst,et al.  Bigrams of Syntactic Labels for Authorship Discrimination of Short Texts , 2007, Lit. Linguistic Comput..

[15]  W. Wiersma,et al.  Language Contact. New Perspectives , 2010 .

[16]  Etienne Barnard,et al.  Orthographic measures of language distances between the official South African languages , 2008 .

[17]  W. Wiersma,et al.  A Measure of Aggregate Syntactic Distance , 2006 .

[18]  E. Gelderen Split Infinitives in Early Middle English , 2016 .

[19]  D. Lightfoot English: The language of the Vikings by Joseph Embley Emonds and Jan Terje Faarlund (review) , 2016 .

[20]  Elly van Gelderen,et al.  A History of the English Language , 2000 .

[21]  Benedikt Szmrecsanyi,et al.  Corpus-based Dialectometry: Aggregate Morphosyntactic Variability in British English Dialects , 2008, Int. J. Humanit. Arts Comput..

[22]  Jelena Golubović Mutual intelligibility in the Slavic language area , 2016 .

[23]  R. Sokal,et al.  THE COMPARISON OF DENDROGRAMS BY OBJECTIVE METHODS , 1962 .

[24]  Francisca Swarte Predicting the mutual intelligibility of Germanic languages from linguistic and extra-linguistic factors , 2016 .

[25]  B. Kortmann The Viking Hypothesis from a Dialectologist’s Perspective , 2016 .

[26]  Charlotte Gooskens,et al.  How easy is it for speakers of Dutch to understand spoken and written Frisian and Afrikaans, and why? , 2005 .

[27]  Erhard W. Hinrichs,et al.  Linguistic Distances Workshop at the joint conference of International Committee on Computational Linguistics and the Association for Computational Linguistics , 2006 .

[28]  P. Trudgill Norsified English or Anglicized Norse , 2016 .

[29]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[30]  Shane S. Sturrock,et al.  Time Warps, String Edits, and Macromolecules – The Theory and Practice of Sequence Comparison . David Sankoff and Joseph Kruskal. ISBN 1-57586-217-4. Price £13.95 (US$22·95). , 2000 .

[31]  Bas Aarts,et al.  Exploring Natural Language: Working with the British Component of the International Corpus of English , 2002 .

[32]  E. MacMurray,et al.  Textometry and Information Discovery : A New Approach to Mining Textual Data on the Web , 2011 .

[33]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[34]  John Nerbonne,et al.  Applying Language Technology to Detect Shift Effects , 2010 .

[35]  Anne Brontë Tenant of Wildfell Hall , 1848 .

[36]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[37]  Markku Filppula External Influences on English: From its Beginnings to the Renaissance , 2014 .

[38]  Boris Katz,et al.  Using Syntactic Information to Identify Plagiarism , 2005 .

[39]  N. Mantel The detection of disease clustering and a generalized regression approach. , 1967, Cancer research.

[40]  John Nerbonne,et al.  Detecting Syntactic Contamination in Emigrants: The English of Finnish Emigrants , 2007 .

[41]  Geoffrey Sampson,et al.  A proposal for improving the measurement of parse accuracy , 2000 .

[42]  Wilbert Jan Heeringa Measuring dialect pronunciation differences using Levenshtein distance , 2004 .