Which annotation scheme is more expedient to measure syntactic difficulty and cognitive demand?

This paper investigates which annotation scheme of dependency treebank is more congruent for the measurement of syntactic complexity and cognitive constraint of language materials. Two representatives of semanticand syntactic-oriented annotation schemes, the Universal Dependencies (UD) and the Surface-Syntactic Universal Dependencies (SUD), are under discussion. The results show that, on the one hand, natural languages based on both annotation schemes follow the universal linguistic law of Dependency Distance Minimization (DDM); on the other hand, according to the metric of Mean Dependency Distances (MDDs), the SUD annotation scheme that accords with traditional dependency syntaxes are more expedient to measure syntactic difficulty and cognitive demand. 1 Background and Motivation Dependency grammar deals with the syntactically related words, i.e. the governor and the dependent, within sentence structure (Heringer, 1993; Hudson, 1995; Liu, 2009). It can be dated back to the seminal work of Eléments de Syntaxe Structurale by Tesnière (1959), and developed through different theories, including Word Grammar, Meaning-Text-Theory, Lexicase, etc. (e.g. Hudson,1984; Mel’čuk, 1988; Starosta, 1988; Eroms, 2000). Thus far, there are many representations of dependency grammar. Figure 1 displays two typical dependency representations of one sample sentence We walked along the lake. Figure 1. Dependency Representations of One English Sentence We walked along the lake Based on UD and SUD Annotation Schemes. The dependency representation based on the Universal Dependencies (UD), as shown in Figure 1 (a), is one of the most eminent models by now under the framework of dependency grammar. It attempts at establishing a multilingual morphosyntactic scheme to annotate various languages in a consistent manner (Nivre, 2015; Osborne and Gerdes, 2019). Thus, the UD annotation scheme holds a semantic over See also http://universaldependencies.org/. syntactic criteria to put priorities to content words to maximize “crosslinguistic parallelism” (Nivre, 2015; de Marneffe and Nivre, 2019). On the contrary, the Surface-Syntactic Universal Dependencies (SUD) annotation scheme, as shown in Figure 1 (b), follows the syntactic criteria to define not only the dependency labels but also the dependency links. It aims to make the annotation scheme close to the dependency traditions, like Meaning-Text-Theory (MTT) (Mel’čuk, 1988), Word Grammar (Hudson, 1984), etc. Hence, the SUD annotation scheme is a syntactic-oriented dependency representation that seeks to promote the syntactic motivations (Gerdes et al., 2018; Osborne and Gerdes, 2019). Therefore, the UD and SUD annotation schemes signify two typical preferences of dependency grammar, one is semantic-oriented, and the other is syntactic-oriented. As shown in Figure 1, the linear sentence in both representations can be divided into several words; and the labelled arcs, directed from the governors to the dependents, represent different dependency types indicating the syntactic relations between elements within the sentence. Hence, the dependency representations indicate both the functional role of each word as well as the syntactic relations between different elements. More importantly, based on dependency representations, linguists have proposed several measurements for linguistic analysis. For one thing, dependency distance is defined as the linear distance of the governor and the dependent (Hudson, 1995). For another, the linear order of the governor and the dependent of each dependency type is referred to as dependency direction (Liu, 2010). When a governor appears before a dependent, the dependency direction is governor-initial or negative. Otherwise, it is governor-final or positive. For instance, in Figure 1 (a), the arc above the dependent we and the governor walked forms a governor-final relation; and the dependency distance between these two elements is 2 – 1 = 1 (the number 2 and 1 in the subtraction represent the linear order of the governor and dependent, respectively). Detailed calculating method will be shown in Section 2. Therefore, the dependency representations and the measures of dependency relations are both explicit and clear-cut. This explains the reason why dependency treebanks, i.e. corpora with annotations (Abeillé, 2003), are widespread among linguists in big-data era. As a result, the variations and universals of human languages are explored and unveiled through statistical and mathematical tools (Hudson, 1995; Liu et al., 2017). What is noteworthy is that previous studies have shown that dependency distance is an important indicator in demonstrating the notion of syntactic complexity and cognitive demand (Hudson, 1995; Gibson, 2000; Liu, 2008). Under the framework of dependency grammar, Hudson (1995) characterized the definition of dependency distance based on the theories of memory decaying and short-term memory (e.g. Brown, 1958; Levy et al., 2013). The notion of syntactic difficulty and cognitive demand have been subsequently related to the linear distance between the governors and the dependents in cognitive science (Gibson, 1998; Hawkins, 2004). Based on a Romanian dependency treebank, Ferrer-i-Cancho (2004) hypothesized and proved that the mean distance of a sentence is minimized and constrained. These paved the way for Liu’s (2008) empirical study on dependency distance which provides a viable treebank-based approach towards the metric of syntactic complexity and cognitive constraint. Afterwards, series of studies exploring the relationship between dependency distance and syntactic and cognitive benchmarks have been conducted (e.g. Jiang and Liu, 2015; Wang and Liu, 2017; Liu et al., 2017). These studies share some similarities. First, it is well-corroborated that the frequency of dependency distance decreases with the increase of the dependency distance, viz., the distribution of dependency distance follows the linguistic law of the Least Effort Principle (LEP) or Dependency Distance Minimization (DDM) (Zipf, 1965; Liu et al., 2017). Second, it is believed that the greater the dependency distance is, the more difficult the sentence structure (Gibson, 1998; Hiranuma, 1999; Liu et al., 2017). Thus, the arithmetic average of all dependency distances of one sentence or a treebank or the mean dependency distances (MDDs) (Liu, 2008) has been an important index of memory burden, demonstrating the syntactic complexity and cognitive demand of the language concerned (Hudson, 1995; Liu et al., 2017). Previous studies have shown that there are several factors that have effects on the measurement of dependency distance of a sentence, including sentence length, genre, chunking, language type, grammar, annotation scheme and so forth (e.g. Jiang and Liu, 2015; Wang and Liu, 2017; Lu et al., 2016; Hiranuma, 1999; Liu and Xu, 2012; Gildea and Temperley, 2010). Most of these factors have been wellinvestigated, however, the factor of annotation scheme has rarely been studied. Liu et al. (2009), for instance, investigated Chinese syntactic and typological properties based on five different Chinese See also https://gitlab.inria.fr/grew/SUD. treebanks with different genres and annotation schemes, yet the treebanks adopted with different annotation schemes were used to avoid the corpus influences to ensure a reliable conclusion. Hence, the question as to the effects of annotation scheme on the distribution of dependency distance and MDD remains open. Moreover, investigations into the benchmark of syntactic complexity and cognitive demand introduced above were primarily based on traditionally syntactic-oriented dependency models, for instance, the Stanford Typed Dependencies annotation scheme (de Marneffe and Manning, 2008) or other annotation schemes that specifically designed for each individual language. Thus, there is no consistency among different treebanks. In addition, although there are some qualitative investigations on the distinctions between the UD annotation scheme and various traditional syntactic-oriented annotation schemes (e.g. Osborne and Maxwell, 2015), and the existing studies also include some empirical studies focusing primarily on the consistently annotated UD scheme (e.g. Chen and Gerdes, 2017; 2018), it is still of our interest that, compared with those based on consistently annotated traditionally syntactic-oriented schemes, whether linguistic analysis based on the UD annotation scheme can still function as a metric of syntactic difficulty and cognitive demand, and if it can, what are the reasons for these distinctions? Therefore, the deficiency of investigations into annotation scheme of treebanks leads to the inquiry of current study. We attempt at making comparisons of dependency distances based on two different annotation schemes, UD and SUD. Aimed to address the issues mentioned above, the following questions are under discussion based on UD and SUD treebanks: (1) Will the probability distribution of dependency distances of natural texts change when they are based on different annotation schemes? Do they still follow the linguistic law of DDM? (2) Based on MDDs, which annotation scheme is more congruent for the measurement of syntactic complexity and cognitive demand? (3) Which dependency types account most for the distinctions between UD and SUD annotation schemes? 2 Materials and Methods Taking English language as an example, we adopt the Georgetown University Multilayer Corpus (GUM) (Zeldes, 2017) in UD 2.2 and SUD 2.2 projects. Both versions of the treebank are consisted of seven genres, viz. academic writing, biographies, fiction, interviews, news stories, travel guides and how-to guides. Since the treebanks are balanced in term of genres, it would better demonstrate the general features of the probability distribution of dependency distance when we adopt different annotation schemes. To measure the effectiveness of MDDs as a metric of syntactic dif

[1]  Joakim Nivre,et al.  Towards a Universal Grammar for Natural Language Processing , 2015, CICLing.

[2]  Igor Mel’čuk,et al.  Dependency Syntax: Theory and Practice , 1987 .

[3]  Stanley Starosta The case for lexicase : an outline of lexicase grammatical theory , 1989 .

[4]  Wenwen Li,et al.  Chinese Syntactic and Typological Properties Based on Dependency Syntactic Treebanks , 2009 .

[5]  菅山 謙正,et al.  Word Grammar 理論の研究 , 2005 .

[6]  Timothy Osborne,et al.  The Dependency Status of Function Words: Auxiliaries , 2015, DepLing.

[7]  Amir Zeldes,et al.  The GUM corpus: creating multilayer resources in the classroom , 2016, Language Resources and Evaluation.

[8]  SO HIRANUMA,et al.  Syntactic difficulty in English and Japanese: A textual study , 2022 .

[9]  John Brown Some Tests of the Decay Theory of Immediate Memory , 1958 .

[10]  Ramon Ferrer i Cancho,et al.  Euclidean distance between syntactically linked words. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[11]  N. Cowan The magical number 4 in short-term memory: A reconsideration of mental storage capacity , 2001, Behavioral and Brain Sciences.

[12]  E. Gibson Linguistic complexity: locality of syntactic dependencies , 1998, Cognition.

[13]  Richard Futrell,et al.  Large-scale evidence of dependency length minimization in 37 languages , 2015, Proceedings of the National Academy of Sciences.

[14]  Bruno Guillaume,et al.  SUD or Surface-Syntactic Universal Dependencies: An annotation scheme near-isomorphic to UD , 2018, UDW@EMNLP.

[15]  J. Hawkins Efficiency and complexity in grammars , 2004 .

[16]  Haitao Liu,et al.  Dependency Distance as a Metric of Language Comprehension Difficulty , 2008 .

[17]  Haitao Liu,et al.  The effects of genre on dependency distance and dependency direction , 2017 .

[18]  J. V. Dam Syntax der Deutschen Sprache , 1972 .

[19]  Haitao Liu,et al.  Quantitative typological analysis of Romance languages , 2012 .

[20]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[21]  Timothy Osborne,et al.  A Historical Overview of the Status of Function Words in Dependency Grammar , 2015, DepLing.

[22]  D. Adger,et al.  Syntax , 2014, Wiley interdisciplinary reviews. Cognitive science.

[23]  P. Lachenbruch Statistical Power Analysis for the Behavioral Sciences (2nd ed.) , 1989 .

[24]  Evelina Fedorenko,et al.  The syntactic complexity of Russian relative clauses , 2012, Journal of memory and language.

[25]  Haitao Liu,et al.  Dependency direction as a means of word-order typology: A method based on dependency treebanks , 2010 .

[26]  Ivan A. Sag,et al.  Book Reviews: Head-driven Phrase Structure Grammar and German in Head-driven Phrase-structure Grammar , 1996, CL.

[27]  Treebanks Treebanks Building and Using Parsed Corpora , 2011 .

[28]  Haitao Liu,et al.  Dependency distance: A new perspective on syntactic patterns in natural languages. , 2017, Physics of life reviews.

[29]  Daniel Gildea,et al.  Do Grammars Minimize Dependency Length? , 2010, Cogn. Sci..

[30]  Ray Jackendoff,et al.  Semantic Interpretation in Generative Grammar , 1972 .

[31]  Haitao Liu,et al.  Can chunking reduce syntactic complexity of natural languages? , 2016, Complex..

[32]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[33]  J. Nichols Head-marking and dependent-marking grammar , 1986 .

[34]  Christopher D. Manning,et al.  Stanford typed dependencies manual , 2010 .

[35]  Wolfgang Sternefeld,et al.  Syntax: An International Handbook of Contemporary Research , 1993 .

[36]  Lucien Tesnière Éléments de syntaxe structurale , 1959 .

[37]  Timothy Osborne,et al.  The status of function words in dependency grammar: A critique of Universal Dependencies (UD) , 2019, Glossa: a journal of general linguistics.

[38]  Haitao Liu,et al.  The effects of sentence length on dependency distance, dependency direction and the implications–Based on a parallel English–Chinese dependency treebank , 2015 .