EXPLOITING UNIVERSAL DEPENDENCIES TREEBANKS FOR MEASURING MORPHOSYNTACTIC COMPLEXITY

There has been recent interest in quantifying linguistic complexity (Juola, 1998; Dahl, 2004; Newmeyer & Preston, 2014; Bentz, Alikaniotis, Cysouw, & Ferrer-i Cancho, 2017; Koplenig, Meyer, Wolfer, & Mueller-Spitzer, 2017; Stump, 2017). Besides the theoretical interest, quantifying complexity of languages or subsystems of languages is also important for first and second language acquisition research. In this paper, we present a number of morphosyntactic measures, some proposed in earlier literature, and some novel to the best of our knowledge. The Measuring Linguistic Complexity (MLC) shared task aims to bring together different measures of linguistic complexity, encouraging the use of Universal Dependencies (UD) treebanks (Nivre et al., 2016). The UD project defines a unified tagset, and the UD treebanks already include a large number of languages.1 The multi-lingual focus of the UD project requires paying attention to linguistic typology (Croft, Nordquist, Looney, & Regan, 2017), and the treebanks, in return, constitute a promising resource for the typological (and in general multi-lingual) research. Not surprisingly, the MLC shared task offers a subset of the UD treebanks as the data set for measuring complexity of (subsystems of) languages. In this paper, we present a number of quantitative measures of morphosyntactic complexity, namely, type/token ratio (TTR, e.g., Kettunen, 2014); mean size of paradigm (MSP Xanthos et al., 2011); entropy of morphological-feature distribution; entropy of morphological-feature distribution conditioned on the word

[1]  Michael Cysouw,et al.  The Entropy of Words - Learnability and Expressivity across More than 1000 Languages , 2017, Entropy.

[2]  Taraka Rama,et al.  A Telugu treebank based on a grammar book , 2018, TLT.

[3]  Yuri M. Suhov,et al.  Nonparametric Entropy Estimation for Stationary Processesand Random Fields, with Applications to English Text , 1998, IEEE Trans. Inf. Theory.

[4]  Robert Malouf,et al.  Morphological Organization: The Low Conditional Entropy Conjecture , 2013 .

[5]  Frederick J. Newmeyer,et al.  Measuring Grammatical Complexity , 2014 .

[6]  Östen Dahl,et al.  The growth and maintenance of linguistic complexity , 2004 .

[7]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[8]  Çağrı Çöltekin,et al.  A grammar-book treebank of Turkish , 2017 .

[9]  Carolin Müller-Spitzer,et al.  The statistical trade-off between word order and word structure – Large-scale evidence for the principle of least effort , 2016, PloS one.

[10]  W. Dressler,et al.  On the role of morphological richness in the early development of noun and verb inflection , 2011 .

[11]  Gregory T. Stump The Nature and Dimensions of Complexity in Morphology , 2017 .

[12]  Kimmo Kettunen,et al.  Can Type-Token Ratio be Used to Show Morphological Complexity of Languages?* , 2014, J. Quant. Linguistics.

[13]  Patrick Juola Measuring Linguistic Complexity: The Morphological Tier , 1998, J. Quant. Linguistics.

[14]  William Croft,et al.  Linguistic Typology meets Universal Dependencies , 2017, TLT.

[15]  Yun Gao,et al.  Estimating the Entropy of Binary Time Series: Methodology, Some Theory and a Simulation Study , 2008, Entropy.